Instructions to use ZGZzz/SAME with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- SAME
How to use ZGZzz/SAME with SAME:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts
🍕AIML, University of Adelaide 🌭Adobe Research 🍔UNC, Chapel Hill 🌮UNSW SydneyModel Description
SAME (State-Adaptive Mixture of Experts) is a unified framework for language-guided visual navigation that consolidates diverse navigation tasks into a single versatile agent. Unlike previous task-specific approaches, SAME can handle both high-level category-specific search (e.g., "find a chair") and low-level language-guided navigation (e.g., detailed turn-by-turn instructions) through a novel state-adaptive Mixture of Experts (MoE) architecture.
Key Features
- Multi-Task Capability: Single model handles 9 different navigation datasets simultaneously
- State-Adaptive MoE: Dynamic expert routing based on multimodal features (text + visual observations)
- Simulator-Free: Works entirely with pre-computed CLIP ViT-B/16 features - no simulator installation required
- Flexible Architecture: MoE can be placed at attention query, key-value, or feed-forward network positions
Model Architecture
SAME is built on a transformer-based architecture with the following key components:
| Component | Description |
|---|---|
| Language Encoder | 9-layer BERT-based transformer encoder |
| Image Embeddings | Processes 512-dim CLIP ViT-B/16 panoramic features |
| Local VP Encoder | Viewport-level information with crossmodal fusion |
| Global Map Encoder | Global spatial graph with dynamic routing |
| State-Adaptive MoE | 8 experts with top-2 selection, multimodal routing |
MoE Routing
The State-Adaptive MoE uses multimodal features (fused text + visual embeddings) to dynamically route tokens to specialized experts. This allows the model to adapt its behavior based on:
- The granularity of language instructions
- Current visual observations
- Navigation task requirements
Intended Uses
Primary Use Cases
- Vision-and-Language Navigation (VLN): Following natural language instructions in indoor environments
- Object Navigation: Finding target objects given category names
- Dialog-based Navigation: Multi-turn conversational navigation
- Remote Object Grounding: Navigating to and identifying remote objects
Supported Tasks
| Task | Dataset | Description |
|---|---|---|
| Low-Level Navigation | R2R, R2R-PREVALENT, R2R-ScaleVLN | Fine-grained instruction following |
| Object Grounding | REVERIE, REVERIE-ScaleVLN | Navigate and ground remote objects |
| Long Horizontal VLN | RXR-EN | Long horizon navigation (English) |
| Dialog Navigation | CVDN | Cooperative vision-and-dialog navigation |
| Object Search | SOON | Semantic object-oriented navigation |
| Object Navigation | ObjectNav-MP3D | Category-based object finding |
How to Use
Installation
git clone https://github.com/GengzeZhou/SAME.git
cd SAME
conda create --name SAME python=3.10
conda activate SAME
pip install -r requirements.txt
Download Data and Models
# Download all datasets and features
python download.py --data
# Download pretrained models
python download.py --pretrain
# Download trained checkpoints (optional)
python download.py --checkpoints
Training
cd src
# Single GPU training
python run.py --config_dir configs/main_multi_q.yaml
# Multi-GPU distributed training
torchrun --nproc_per_node=4 --master_port=29500 \
run.py --config_dir configs/main_multi_q.yaml
Evaluation
cd src
python run.py --config_dir configs/test.yaml \
--options experiment.resume_file=/path/to/checkpoint.pt
Configuration Options
model:
use_moe_layer: true
moe_type: "Task" # Task-based MoE
moe_position: "Attn_q" # Attn_q, Attn_kv, or FFN
task_routing_feature: "multi" # Multimodal routing (recommended)
num_experts: 8
num_experts_per_tok: 2 # Top-2 expert selection
Training Details
Training Data
SAME is trained on 9 navigation datasets with weighted sampling:
| Dataset | Environment | Sampling Weight |
|---|---|---|
| R2R-ScaleVLN | HM3D | 10-20 |
| R2R-PREVALENT | MP3D | 1 |
| R2R | MP3D | 1 |
| REVERIE-ScaleVLN | HM3D | 1-10 |
| REVERIE | MP3D | 1 |
| RXR-EN | MP3D | 1 |
| CVDN | MP3D | 1 |
| SOON | MP3D | 1 |
| ObjectNav-MP3D | MP3D (Habitat) | 2 |
Training Hyperparameters
- Optimizer: AdamW
- Learning Rate: 1e-5
- Total Iterations: 500,000
- Batch Size: 16
- Gradient Clipping: 0.5
- Training Algorithm: DAgger (Dataset Aggregation)
- MoE Auxiliary Loss Coefficient: 0.8
Visual Features
- Feature Extractor: CLIP ViT-B/16
- Feature Dimension: 512
- Format: HDF5 / LMDB
- Environments: MatterSim, Habitat-MP3D, Habitat-HM3D
Evaluation Results
SAME achieves state-of-the-art or highly competitive performance across all navigation benchmarks as a unified model, outperforming task-specific approaches in many cases.
Main Results (Unified Model)
Room-to-Room (R2R)
| Split | SR ↑ | SPL ↑ |
|---|---|---|
| Val Unseen | 76 | 66 |
| Test Unseen | 74 | 64 |
REVERIE
| Split | SR ↑ | SPL ↑ |
|---|---|---|
| Val Unseen | 46.4 | 36.1 |
| Test Unseen | 48.6 | 37.1 |
RxR-EN (Multilingual VLN)
| Split | SR ↑ | nDTW ↑ |
|---|---|---|
| Val Unseen | 50.5 | 51.2 |
CVDN (Dialog Navigation)
| Split | GP ↑ |
|---|---|
| Val | 6.94 |
| Test | 7.07 |
SOON (Object-Oriented Navigation)
| Split | SR ↑ | SPL ↑ |
|---|---|---|
| Val Unseen | 36.1 | 25.4 |
| Test Unseen | 38.2 | 27.1 |
ObjectNav-MP3D
| Split | SR ↑ | SPL ↑ |
|---|---|---|
| Val | 76.3 | 42.7 |
Evaluation Metrics
- SR (Success Rate): Percentage of successful navigations (within 3m of goal)
- SPL (Success weighted by Path Length): Efficiency-weighted success rate
- nDTW (normalized Dynamic Time Warping): Path similarity to ground truth
- GP (Goal Progress): Progress towards the goal in dialog navigation
- NE (Navigation Error): Distance to goal at episode end
- OSR (Oracle Success Rate): Success rate with oracle stop action
Model Variants
| Variant | MoE Position | Routing | Checkpoint |
|---|---|---|---|
| SAME-Q | Attention Query | Multimodal | Attnq_pretrained_ckpt.pt |
| SAME-KV | Attention K/V | Multimodal | Attnkv_pretrained_ckpt.pt |
| SAME-FFN | Feed-Forward | Multimodal | FFN_pretrained_ckpt.pt |
Limitations
- Indoor Environments Only: Trained and evaluated on indoor navigation datasets
- Pre-computed Features: Requires pre-extracted CLIP features; cannot process raw images directly
- English Language: Primary support for English instructions (though RXR provides multilingual data)
- Static Environments: Assumes static environments without dynamic obstacles or agents
Environmental Impact
- Hardware: Training conducted on NVIDIA A100 GPUs
- Training Time: Approximately 2-3 days on 4x A100 GPUs
Citation
If you find this work helpful, please cite:
@article{zhou2024same,
title={SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts},
author={Gengze Zhou and Yicong Hong and Zun Wang and Chongyang Zhao and Mohit Bansal and Qi Wu},
journal={arXiv preprint arXiv:2412.05552},
year={2024},
}
Authors
- Gengze Zhou - AIML, University of Adelaide (Website)
- Yicong Hong - Adobe Research (Website)
- Zun Wang - UNC Chapel Hill (Website)
- Chongyang Zhao - UNSW Sydney (GitHub)
- Mohit Bansal - UNC Chapel Hill (Website)
- Qi Wu - University of Adelaide (Website)
Acknowledgements
We extend our gratitude to:
- MatterPort3D for the open-source platform
- DUET for the foundational architecture
- ScaleVLN for augmented training data
- NaviLLM for additional insights
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contact
For questions or issues, please open an issue on the GitHub repository or contact the authors.
- Downloads last month
- 11
Paper for ZGZzz/SAME
Evaluation results
- SR (val_unseen) on Room-to-Room (R2R)self-reported76.000
- SPL (val_unseen) on Room-to-Room (R2R)self-reported66.000
- SR (test_unseen) on Room-to-Room (R2R)self-reported74.000
- SPL (test_unseen) on Room-to-Room (R2R)self-reported64.000
- SR (val_unseen) on REVERIEself-reported46.400
- SPL (val_unseen) on REVERIEself-reported36.100
- SR (test_unseen) on REVERIEself-reported48.600
- SPL (test_unseen) on REVERIEself-reported37.100
- SR (val_unseen) on RxR-ENself-reported50.500
- nDTW (val_unseen) on RxR-ENself-reported51.200
- GP (val) on CVDNself-reported6.940
- GP (test) on CVDNself-reported7.070
- SR (val_unseen) on SOONself-reported36.100
- SPL (val_unseen) on SOONself-reported25.400
- SR (test_unseen) on SOONself-reported38.200
- SPL (test_unseen) on SOONself-reported27.100
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js