USIM and U0: A Vision-Language-Action Dataset and Model for General Underwater Robots
Paper • 2510.07869 • Published
Model ID: Vincent2025hello/u0_final
Base Model: nvidia/GR00T-N1.5-3B
License: Apache 2.0
Paper: USIM and U0: A Vision-Language-Action Dataset and Model for General Underwater Robots
This model is a Vision-Language-Action (VLA) policy fine-tuned from NVIDIA GR00T N1.5 (3B parameters) for the U0 underwater robot (based on BlueROV2). It takes dual-camera visual observations and multi-sensor state inputs, and outputs 16-step action trajectories for autonomous underwater tasks.
| Item | Value |
|---|---|
| Base Model | GR00T-N1.5-3B |
| Fine-Tuning Method | Full Fine-Tuning (with visual tuning) |
| Action Horizon | 16 steps |
| Denoising Steps | 4 (inference) |
| Embodiment Tag | new_embodiment |
| Data Config | u0_bot |
joint_pos (6): joint positionspwm (8): thruster PWM valuesjoint_v (5): joint velocitiesdvl_v (3): DVL velocityimu_av (3): IMU angular velocityimu_la (3): IMU linear accelerationpressure (1): depth pressuredvl_h (1): DVL altitudejoint_pos (6): target joint positionspwm (8): target thruster PWM valuespip install huggingface_hub
hf download Vincent2025hello/u0_final --local-dir ./u0_final
The complete fine-tuning and evaluation framework is available at: https://github.com/VincentGu2000/u0
@misc{gu2025usimu0visionlanguageactiondataset,
title={USIM and U0: A Vision-Language-Action Dataset and Model for General Underwater Robots},
author={Junwen Gu and Zhiheng Wu and Pengxuan Si and Shuang Qiu and Yukai Feng and Luoyang Sun and Laien Luo and Lianyi Yu and Jian Wang and Zhengxing Wu},
year={2025},
eprint={2510.07869},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2510.07869},
}
This model is fine-tuned from NVIDIA GR00T N1.5. We thank the NVIDIA GEAR team for open-sourcing the GR00T model and framework.
Base model
nvidia/GR00T-N1.5-3B