GR00T-N1.7-3B-PickOrange (self-trained, ckpt-6000)

针对 LeIsaac SO-101 PickOrange 任务从 nvidia/GR00T-N1.7-3B (Cosmos-Reason2 2B VLM + DiT action head, ~3B params) 微调的 GR00T 策略。

A NVIDIA GR00T N1.7 (Cosmos-Reason2-2B VLM + DiT action head, ~3B) policy fine-tuned from nvidia/GR00T-N1.7-3B for the LeIsaac SO-101 PickOrange task.

🔗 项目仓库 / Project repos：

vitorcen/isaaclab-experience — Isaac Lab + LeIsaac 多策略横评（parent project）
vitorcen/LeIsaac-Training — LeIsaac fork（训练脚本 + 设计文档 / training scripts + design docs）

Highlights

ckpt-6000: 3/3 oranges placed perfectly — env reports success

TL;DR

Task: SO-101 single-arm picks 3 oranges sequentially and places each in a plate (LeIsaac PickOrange).
Architecture: GR00T N1.7 — Cosmos-Reason2-2B VLM (frozen) + Gr00tN1d7 DiT action head (trainable, ~880M trainable / 2.2B total). action_horizon=40, 4-step rectified-flow denoising.
Training: 6000 step / batch=32 (per-step=8 × grad_accum=4) / adamw_torch / bf16 / no gradient_checkpointing.
Hardware: single RTX 5090 32GB on AutoDL (cold-start from Cosmos-Reason2-2B backbone — skip_weight_loading=True).
🏆 BENCH-aligned eval (5 round × 120s sim × 180s wall_cap, step_hz=60, h=40) vs LeIsaac leaderboard:

Model	Strict envs	Oranges placed	Avg time
hi-space N1.7 (公开 SOTA)	4/5	14/15	69s
🏆 This ckpt-6000	4/5	14/15 ⭐	116s
wsagi N1.6 ckpt-6500 (previous self-trained SOTA)	4/5	13/15	92s
hi-space N1.6	2/5	9/15	122s

Tied with hi-space N1.7 SOTA on per-round + total oranges (single 5-round run), ~40% slower per round (likely data collection speed / smoothing gap, model capability identical).

📊 Tighter statistics — 20-round (60 episodes) BENCH distribution

Same EPISODE_LENGTH_S=120 / MAX_ROUND_WALL_S=180 / step_hz=60 / h=40 protocol, 60 oranges total:

Placed per episode	Count	P(placed=k)
3 (perfect)	10	50.0%
2	4	20.0%
1	3	15.0%
0	3	15.0%

E(oranges/ep) = 2.05 / 3 = 68.3% pick rate (41/60 oranges placed).

5-round sample std σ ≈ 0.83 oranges (5.5%) — so a single 14/15 measurement is 4σ above the true mean (~10/15) and likely a lucky outlier. The published 14/15 for hi-space's N1.7 (and ours) should be interpreted as upper-bound single-run, not the steady-state SOTA.

Per-episode raw data: placed_per_ep = [2, 0, 3, 3, 3, 2, 2, 3, 3, 1, 0, 3, 3, 0, 3, 1, 3, 3, 2, 1]. Note ep 12 placed 3/3 oranges but env didn't fire task_done (model didn't return arm to rest pose within wall_cap) — see "Wire protocol fix" + LeIsaac/scripts/evaluation/policy_inference.py retracted-middle early-end patch for future runs.

Architecture / training recipe

base_model_path                  /path/to/cosmos_raw (cold-start, skip_weight_loading=True)
model_name                       nvidia/Cosmos-Reason2-2B (kept upstream default)
tune_llm                         False
tune_visual                      False
tune_projector                   True
tune_diffusion_model             True
tune_top_llm_layers              0          ← match hi-space N1.7 recipe (default 4 produces broken ckpt with save_only_model=True)
backbone_trainable_params_fp32   True       ← kept upstream default (5090 32GB has room)
optim                            adamw_torch ← 5090 32GB allows full AdamW
gradient_checkpointing           False      ← off on 5090 32GB (~25% throughput gain)
bf16                             True
global_batch_size                32
gradient_accumulation_steps      4          ← per-step micro-batch = 8
dataloader_num_workers           4
max_steps                        6000
save_steps                       1200
save_only_model                  False      ← CRITICAL: True loses 332/1030 frozen-backbone keys

Performance optimizations (used during training)

Trained with LEISAAC_FRAME_CACHE_DIR + non_blocking H2D + prefetch_factor=4 perf stack — see LeIsaac/docs/training/gpu_dataloader_zero_copy.html:

Pre-decoded video memmap cache: 60 ep × 2 cam → 62 GB uint8 cache, eliminates per-step H264 decode (~17× single-thread fps)
non_blocking H2D copy: Trainer._prepare_input overlap with previous step's forward/backward (+5-11 pp GPU util)
DataLoader prefetch_factor=4: workers stay 4 batches ahead of trainer (HF default 2)
Step rate: 1.23 s/it stable on 5090 32GB (vs 2.2 s/it w/ stalls without perf stack — 4.4× speedup)
Wall clock: 6000 step ≈ 2 hours on 5090 (vs ~9 hours without perf)

Wire protocol fix (LeIsaac eval client)

N1.7 server (--use-sim-policy-wrapper) requires a 4-layer wire fix in LeIsaac/source/leisaac/leisaac/policy/service_policy_clients.py — see LeIsaac/docs/training/gr00t_n17_sim_wire_protocol_debug.html:

wrap obs in {"observation": flat_obs} envelope (sim_wrapper kwargs unpack)
add T=1 axis to video/state — wrapper requires (B, T, H, W, C) 5D uint8 + (B, T, D) 3D float32
msgpack-numpy bytes-key dict decode (server returns {b'nd': True, b'data': bytes, ...})
mnp.decode requires bytes keys (str → bytes re-encoding required)

Use POLICY_TYPE=gr00tn1.5 in LeIsaac eval (N1.5 wire flow is compatible with N1.7 + sim_wrapper after the above fix).

Training cost

Optimized (perf stack): 6000 step on 5090 ≈ 2 hours × ¥4/h ≈ ¥8
Without optimization: 6000 step on 5090 ≈ 9 hours × ¥4/h ≈ ¥36

Total cumulative cost for this project including initial setup, broken runs, and additional ckpt-7200/8400/9600 (later confirmed overfit): ~¥60.

Inference

Use Isaac-GR00T's run_gr00t_server.py directly:

cd /path/to/Isaac-GR00T
uv run --no-sync python gr00t/eval/run_gr00t_server.py \
    --embodiment-tag NEW_EMBODIMENT \
    --model-path wsagi/GR00T-N1.7-PickOrange \
    --host 0.0.0.0 --port 5555 \
    --use-sim-policy-wrapper

Then on the Isaac Sim eval side (LeIsaac), apply the wire-protocol fixes referenced above and run:

POLICY_PORT=5555 \
ACTION_HORIZON=40 POLICY_TYPE=gr00tn1.5 \
EVAL_ROUNDS=5 EPISODE_LENGTH=120 MAX_ROUND_WALL_S=180 \
PROMPT="Pick up the orange and put it in the plate" \
bash server/eval_gr00t.sh

License

Apache-2.0 / NVIDIA Open Model License (inherited from base nvidia/GR00T-N1.7-3B + Cosmos-Reason2-2B). See base model cards.

Downloads last month: 94

Safetensors

Model size

2B params

Tensor type

F32

Video Preview

Robotics

Model tree for wsagi/GR00T-N1.7-PickOrange

Base model

nvidia/GR00T-N1.7-3B

Finetuned

(6)

this model

wsagi
/

GR00T-N1.7-PickOrange