GR00T-N1.7-3B-PickOrange (self-trained, ckpt-6000)
针对 LeIsaac SO-101 PickOrange 任务从 nvidia/GR00T-N1.7-3B (Cosmos-Reason2 2B VLM + DiT action head, ~3B params) 微调的 GR00T 策略。
A NVIDIA GR00T N1.7 (Cosmos-Reason2-2B VLM + DiT action head, ~3B) policy fine-tuned from nvidia/GR00T-N1.7-3B for the LeIsaac SO-101 PickOrange task.
🔗 项目仓库 / Project repos:
- vitorcen/isaaclab-experience — Isaac Lab + LeIsaac 多策略横评(parent project)
- vitorcen/LeIsaac-Training — LeIsaac fork(训练脚本 + 设计文档 / training scripts + design docs)
Highlights
ckpt-6000: 3/3 oranges placed perfectly — env reports success
TL;DR
- Task: SO-101 single-arm picks 3 oranges sequentially and places each in a plate (LeIsaac PickOrange).
- Architecture: GR00T N1.7 — Cosmos-Reason2-2B VLM (frozen) + Gr00tN1d7 DiT action head (trainable, ~880M trainable / 2.2B total).
action_horizon=40, 4-step rectified-flow denoising. - Training: 6000 step / batch=32 (per-step=8 × grad_accum=4) / adamw_torch / bf16 / no gradient_checkpointing.
- Hardware: single RTX 5090 32GB on AutoDL (cold-start from Cosmos-Reason2-2B backbone —
skip_weight_loading=True). - 🏆 BENCH-aligned eval (5 round × 120s sim × 180s wall_cap, step_hz=60, h=40) vs LeIsaac leaderboard:
| Model | Strict envs | Oranges placed | Avg time |
|---|---|---|---|
| hi-space N1.7 (公开 SOTA) | 4/5 | 14/15 | 69s |
| 🏆 This ckpt-6000 | 4/5 | 14/15 ⭐ | 116s |
| wsagi N1.6 ckpt-6500 (previous self-trained SOTA) | 4/5 | 13/15 | 92s |
| hi-space N1.6 | 2/5 | 9/15 | 122s |
Tied with hi-space N1.7 SOTA on per-round + total oranges (single 5-round run), ~40% slower per round (likely data collection speed / smoothing gap, model capability identical).
📊 Tighter statistics — 20-round (60 episodes) BENCH distribution
Same EPISODE_LENGTH_S=120 / MAX_ROUND_WALL_S=180 / step_hz=60 / h=40 protocol, 60 oranges total:
| Placed per episode | Count | P(placed=k) |
|---|---|---|
| 3 (perfect) | 10 | 50.0% |
| 2 | 4 | 20.0% |
| 1 | 3 | 15.0% |
| 0 | 3 | 15.0% |
E(oranges/ep) = 2.05 / 3 = 68.3% pick rate (41/60 oranges placed).
5-round sample std σ ≈ 0.83 oranges (5.5%) — so a single 14/15 measurement is 4σ above the true mean (~10/15) and likely a lucky outlier. The published 14/15 for hi-space's N1.7 (and ours) should be interpreted as upper-bound single-run, not the steady-state SOTA.
Per-episode raw data: placed_per_ep = [2, 0, 3, 3, 3, 2, 2, 3, 3, 1, 0, 3, 3, 0, 3, 1, 3, 3, 2, 1]. Note ep 12 placed 3/3 oranges but env didn't fire task_done (model didn't return arm to rest pose within wall_cap) — see "Wire protocol fix" + LeIsaac/scripts/evaluation/policy_inference.py retracted-middle early-end patch for future runs.
Architecture / training recipe
base_model_path /path/to/cosmos_raw (cold-start, skip_weight_loading=True)
model_name nvidia/Cosmos-Reason2-2B (kept upstream default)
tune_llm False
tune_visual False
tune_projector True
tune_diffusion_model True
tune_top_llm_layers 0 ← match hi-space N1.7 recipe (default 4 produces broken ckpt with save_only_model=True)
backbone_trainable_params_fp32 True ← kept upstream default (5090 32GB has room)
optim adamw_torch ← 5090 32GB allows full AdamW
gradient_checkpointing False ← off on 5090 32GB (~25% throughput gain)
bf16 True
global_batch_size 32
gradient_accumulation_steps 4 ← per-step micro-batch = 8
dataloader_num_workers 4
max_steps 6000
save_steps 1200
save_only_model False ← CRITICAL: True loses 332/1030 frozen-backbone keys
Performance optimizations (used during training)
Trained with LEISAAC_FRAME_CACHE_DIR + non_blocking H2D + prefetch_factor=4 perf stack — see LeIsaac/docs/training/gpu_dataloader_zero_copy.html:
- Pre-decoded video memmap cache: 60 ep × 2 cam → 62 GB uint8 cache, eliminates per-step H264 decode (~17× single-thread fps)
- non_blocking H2D copy: Trainer._prepare_input overlap with previous step's forward/backward (+5-11 pp GPU util)
- DataLoader prefetch_factor=4: workers stay 4 batches ahead of trainer (HF default 2)
- Step rate: 1.23 s/it stable on 5090 32GB (vs 2.2 s/it w/ stalls without perf stack — 4.4× speedup)
- Wall clock: 6000 step ≈ 2 hours on 5090 (vs ~9 hours without perf)
Wire protocol fix (LeIsaac eval client)
N1.7 server (--use-sim-policy-wrapper) requires a 4-layer wire fix in LeIsaac/source/leisaac/leisaac/policy/service_policy_clients.py — see LeIsaac/docs/training/gr00t_n17_sim_wire_protocol_debug.html:
- wrap obs in
{"observation": flat_obs}envelope (sim_wrapper kwargs unpack) - add T=1 axis to video/state — wrapper requires (B, T, H, W, C) 5D uint8 + (B, T, D) 3D float32
- msgpack-numpy bytes-key dict decode (server returns
{b'nd': True, b'data': bytes, ...}) mnp.decoderequires bytes keys (str → bytes re-encoding required)
Use POLICY_TYPE=gr00tn1.5 in LeIsaac eval (N1.5 wire flow is compatible with N1.7 + sim_wrapper after the above fix).
Training cost
- Optimized (perf stack): 6000 step on 5090 ≈ 2 hours × ¥4/h ≈ ¥8
- Without optimization: 6000 step on 5090 ≈ 9 hours × ¥4/h ≈ ¥36
Total cumulative cost for this project including initial setup, broken runs, and additional ckpt-7200/8400/9600 (later confirmed overfit): ~¥60.
Inference
Use Isaac-GR00T's run_gr00t_server.py directly:
cd /path/to/Isaac-GR00T
uv run --no-sync python gr00t/eval/run_gr00t_server.py \
--embodiment-tag NEW_EMBODIMENT \
--model-path wsagi/GR00T-N1.7-PickOrange \
--host 0.0.0.0 --port 5555 \
--use-sim-policy-wrapper
Then on the Isaac Sim eval side (LeIsaac), apply the wire-protocol fixes referenced above and run:
POLICY_PORT=5555 \
ACTION_HORIZON=40 POLICY_TYPE=gr00tn1.5 \
EVAL_ROUNDS=5 EPISODE_LENGTH=120 MAX_ROUND_WALL_S=180 \
PROMPT="Pick up the orange and put it in the plate" \
bash server/eval_gr00t.sh
License
Apache-2.0 / NVIDIA Open Model License (inherited from base nvidia/GR00T-N1.7-3B + Cosmos-Reason2-2B). See base model cards.
- Downloads last month
- 94
Model tree for wsagi/GR00T-N1.7-PickOrange
Base model
nvidia/GR00T-N1.7-3B