NOVA / README.md

Upload README.md with huggingface_hub

50b1265 verified 4 months ago

5.94 kB

	---
	license: other
	license_name: nvidia-open-model-license
	license_link: https://developer.nvidia.com/open-model-license
	language:
	- en
	library_name: transformers
	tags:
	- robotics
	- vision-language-action
	- manipulation
	- gr00t
	- nvidia
	- physical-ai
	- humanoid
	- reachy2
	- lerobot
	datasets:
	- ganatrask/NOVA
	base_model:
	- nvidia/GR00T-N1.6-3B
	pipeline_tag: robotics
	---

	# NOVA Model - GR00T N1.6 Fine-tuned for Reachy 2

	<p align="center">
	<img src="https://img.shields.io/badge/NVIDIA-GR00T%20N1.6-76B900?style=for-the-badge&logo=nvidia" alt="GR00T N1.6"/>
	<img src="https://img.shields.io/badge/Robot-Reachy%202-0066CC?style=for-the-badge" alt="Reachy 2"/>
	<img src="https://img.shields.io/badge/Task-Pick%20%26%20Place-green?style=for-the-badge" alt="Pick & Place"/>
	</p>

	NOVA (Neural Open Vision Actions) is a fine-tuned version of NVIDIA's GR00T N1.6 vision-language-action model, trained specifically for [Pollen Robotics' Reachy 2](https://www.pollen-robotics.com/reachy/) humanoid robot.

	## Model Description

	This model is part of an end-to-end Physical AI pipeline that combines:
	- Voice Input: Parakeet CTC 0.6B for speech-to-text
	- Scene Reasoning: Cosmos Reason 2 for object detection and spatial understanding
	- Action Policy: This fine-tuned GR00T N1.6 model for manipulation

	### Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Base Model \| [nvidia/GR00T-N1.6-3B](https://huggingface.co/nvidia/GR00T-N1.6-3B) \|
	\| Parameters \| ~3B \|
	\| Embodiment \| Reachy 2 (custom embodiment tag) \|
	\| Action Space \| 8-DOF (7 arm joints + gripper) \|
	\| Training Steps \| 30,000 \|
	\| Final Loss \| ~0.008-0.01 \|

	### Action Space

	```python
	action = [
	shoulder_pitch, # -180° to 90°
	shoulder_roll, # -180° to 10°
	elbow_yaw, # -90° to 90°
	elbow_pitch, # -125° to 0°
	wrist_roll, # -100° to 100°
	wrist_pitch, # -45° to 45°
	wrist_yaw, # -30° to 30°
	gripper, # 0 (closed) to 1 (open)
	]
	```

	## Intended Use

	This model is designed for:
	- Pick-and-place manipulation tasks on Reachy 2 robot
	- Language-conditioned control ("Pick up the red cube")
	- Research in vision-language-action models and robotic manipulation

	### Supported Tasks

	- Pick up objects (cube, cylinder, capsule, rectangular box)
	- Place objects in target locations
	- Handle 8 color variations (red, green, blue, yellow, cyan, magenta, orange, purple)

	## Training

	### Training Data

	Trained on the [ganatrask/NOVA dataset](https://huggingface.co/datasets/ganatrask/NOVA):
	- 100 episodes of expert demonstrations
	- 32 task variations (4 objects × 8 colors)
	- Domain randomization (position, lighting, camera jitter)
	- LeRobot v2.1 format

	### Training Configuration

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| GPU \| NVIDIA A100-SXM4-80GB \|
	\| GPUs \| 2 \|
	\| Batch Size \| 64 \|
	\| Max Steps \| 30,000 \|
	\| Save Steps \| 3,000 \|
	\| Video Backend \| decord \|

	### Training Command

	```bash
	python -m gr00t.train \
	--dataset_repo_id ganatrask/NOVA \
	--embodiment_tag reachy2 \
	--video_backend decord \
	--num_gpus 2 \
	--batch_size 64 \
	--max_steps 30000 \
	--save_steps 3000 \
	--output_dir ./checkpoints/groot-reachy2
	```

	## Usage

	### Prerequisites

	You need to apply a patch to Isaac-GR00T to add the Reachy 2 embodiment tag:

	```bash
	cd Isaac-GR00T
	patch -p1 < ../patches/add_reachy2_embodiment.patch
	```

	### Inference

	```python
	from gr00t.data.embodiment_tags import EmbodimentTag
	from gr00t.policy.gr00t_policy import Gr00tPolicy
	import importlib.util

	# Load modality config first
	spec = importlib.util.spec_from_file_location(
	"modality_config",
	"configs/reachy2_modality_config.py"
	)
	module = importlib.util.module_from_spec(spec)
	spec.loader.exec_module(module)

	# Load policy
	policy = Gr00tPolicy(
	embodiment_tag=EmbodimentTag.REACHY2,
	model_path="ganatrask/NOVA", # or local checkpoint path
	device="cuda",
	strict=True,
	)

	# Run inference
	obs = {
	"video": {"front_cam": image[None, None, :, :, :]}, # (1, 1, H, W, 3)
	"state": {"arm_joints": joints[None, None, :]}, # (1, 1, 7)
	"language": {"annotation.human.task_description": [["Pick up the red cube"]]},
	}
	action, _ = policy.get_action(obs)
	```

	## Performance

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Inference Speed \| ~40ms/step (A100) \|
	\| VRAM Usage \| ~44GB / 80GB \|
	\| Training Time \| ~6 hours (30K steps) \|

	## Limitations

	- Simulation-trained: Primarily trained on MuJoCo simulation data
	- Single-arm: Currently supports right arm manipulation only
	- Fixed camera setup: Expects front camera input at 224×224 resolution
	- Task scope: Optimized for pick-and-place; may not generalize to other manipulation tasks

	## Ethical Considerations

	- This model should be used for research purposes
	- Human supervision recommended for real robot deployment
	- Not intended for safety-critical applications without extensive testing

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{nova2025,
	title={NOVA: Neural Open Vision Actions},
	author={ganatrask},
	year={2025},
	publisher={HuggingFace},
	url={https://huggingface.co/ganatrask/NOVA}
	}
	```

	## Acknowledgments

	- [NVIDIA](https://developer.nvidia.com/) - GR00T N1.6 base model
	- [Pollen Robotics](https://www.pollen-robotics.com/) - Reachy 2 robot
	- [HuggingFace](https://huggingface.co/) - LeRobot framework
	- [VESSL AI](https://vessl.ai/) - GPU compute for training

	## License

	This model inherits the [NVIDIA Open Model License](https://developer.nvidia.com/open-model-license) from the base GR00T N1.6 model.

	## Links

	- GitHub: [ganatrask/NOVA](https://github.com/ganatrask/NOVA)
	- Dataset: [ganatrask/NOVA](https://huggingface.co/datasets/ganatrask/NOVA)
	- Base Model: [nvidia/GR00T-N1.6-3B](https://huggingface.co/nvidia/GR00T-N1.6-3B)