Instructions to use zzzrw/GEM-2B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use zzzrw/GEM-2B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="zzzrw/GEM-2B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForSeq2SeqLM

processor = AutoProcessor.from_pretrained("zzzrw/GEM-2B")
model = AutoModelForSeq2SeqLM.from_pretrained("zzzrw/GEM-2B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use zzzrw/GEM-2B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "zzzrw/GEM-2B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zzzrw/GEM-2B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/zzzrw/GEM-2B

SGLang

How to use zzzrw/GEM-2B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "zzzrw/GEM-2B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zzzrw/GEM-2B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "zzzrw/GEM-2B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zzzrw/GEM-2B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use zzzrw/GEM-2B with Docker Model Runner:
```
docker model run hf.co/zzzrw/GEM-2B
```

GEM: Generative Supervision Helps Embodied Intelligence

Ruowen Zhao¹, Bangguo Li¹, Zuyan Liu^1,2,†, Yinan Liang¹, Junliang Ye¹, Fangfu Liu¹,
Diankun Wu¹, Zhengyi Wang¹, Xumin Yu², Yongming Rao^2,✉, Han Hu², Jun Zhu^1,✉
^†Project Lead.^✉Corresponding Author.
¹Tsinghua University, ²Tencent Hunyuan

Embodied Vision-Language Models (VLMs) have demonstrated impressive performance and generalization in robotics, particularly within Vision-Language-Action frameworks. However, a significant gap remains between the high-level semantic focus of standard text-guided pre-training paradigms and the low-level spatial and physical knowledge critical for execution in embodied environments. In this paper, we introduce GEM, a Generative-supervised Embodied vision-language Model designed to bridge this divide. We propose integrating a depth map generation task directly into the VLM pre-training phase. By training this generative objective jointly with the main model, we observe substantial improvements in embodied intelligence, significantly enhancing both semantic understanding and physical operation capabilities. To support this paradigm, we curate and release GEM-4M, a comprehensive large-scale dataset featuring a mixture of grounding, reasoning, and planning data paired with high-quality depth supervision. Extensive experiments demonstrate that GEM achieves state-of-the-art results across diverse embodied benchmarks. Furthermore, our deployed action model, GEM-VLA, exhibits vastly superior task execution abilities in both simulation environments and real-world evaluations.

If you find our work helpful, please consider citing:

@article{zhao2026gem,
  title={GEM: Generative Supervision Helps Embodied Intelligence},
  author={Zhao, Ruowen and Li, Bangguo and Liu, Zuyan and Liang, Yinan and Ye, Junliang and Liu, Fangfu and Wu, Diankun and Wang, Zhengyi and Yu, Xumin and Rao, Yongming and others},
  journal={arXiv preprint arXiv:2605.28548},
  year={2026}
}

Downloads last month: 82

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for zzzrw/GEM-2B

Base model

Qwen/Qwen3-VL-2B-Instruct

Finetuned

(218)

this model

Dataset used to train zzzrw/GEM-2B

Paper for zzzrw/GEM-2B

GEM: Generative Supervision Helps Embodied Intelligence

Paper • 2605.28548 • Published 7 days ago • 40