Instructions to use rayruiyang/VST-3B-SFT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use rayruiyang/VST-3B-SFT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="rayruiyang/VST-3B-SFT") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("rayruiyang/VST-3B-SFT") model = AutoModelForImageTextToText.from_pretrained("rayruiyang/VST-3B-SFT") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use rayruiyang/VST-3B-SFT with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "rayruiyang/VST-3B-SFT" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rayruiyang/VST-3B-SFT", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/rayruiyang/VST-3B-SFT
- SGLang
How to use rayruiyang/VST-3B-SFT with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "rayruiyang/VST-3B-SFT" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rayruiyang/VST-3B-SFT", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "rayruiyang/VST-3B-SFT" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rayruiyang/VST-3B-SFT", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use rayruiyang/VST-3B-SFT with Docker Model Runner:
docker model run hf.co/rayruiyang/VST-3B-SFT
Visual Spatial Tuning: VST-3B-SFT
This model is described in the paper Visual Spatial Tuning.
TL;DR: VST is a comprehensive framework designed to cultivate Vision-Language Models (VLMs) with human-like visuospatial abilities—from spatial perception to advanced reasoning.
💡 Key Highlights
✨ VST-P: 4.1M samples across 19 skills, spanning single images, multi-image scenarios, and videos—boosting spatial perception in VLMs.
✨ VST-R: 135K curated samples that teach models to reason in space, including step-by-step reasoning and rule-based data for reinforcement learning.
✨ Progressive Training Pipeline: Start with supervised fine-tuning to build foundational spatial knowledge, then reinforce spatial reasoning abilities via RL. VST achieves state-of-the-art results on spatial benchmarks (34.8% on MMSI-Bench, 61.2% on VSIBench) without compromising general capabilities.
✨ Vision-Language-Action Models Enhanced: The VST paradigm significantly strengthens spatial tuning, paving the way for more physically grounded AI.
📈 Spatial & General Benchmarks
| Models | CV | 3DSR | MMSI | BLINK | VSI | MMStar | MMB | RealworldQA | MMMU | OCRB | AI2D |
|---|---|---|---|---|---|---|---|---|---|---|---|
| VST-3B-SFT | 84.4 | 54.1 | 30.2 | 59.1 | 57.9 | 58.0 | 80.9 | 68.4 | 45.2 | 83.7 | 82.5 |
| VST-3B-RL | 84.2 | 56.5 | 31.3 | 57.2 | 57.7 | 58.9 | 80.5 | 68.5 | 49.8 | 80.9 | 82.4 |
| VST-7B-SFT | 85.5 | 54.6 | 32.0 | 62.1 | 60.6 | 63.1 | 83.3 | 72.2 | 50.6 | 85.5 | 84.9 |
| VST-7B-RL | 86.5 | 60.1 | 34.8 | 62.6 | 61.2 | 63.5 | 83.0 | 68.5 | 49.4 | 86.1 | 83.5 |
📈 VSIBench
| Methods | Avg. | Obj. Count | Abs. Dist. | Obj. Size | Room Size | Rel. Dist | Rel. Dir. | Route Plan | Appr. Order |
|---|---|---|---|---|---|---|---|---|---|
| VST-3B-SFT | 57.9 | 69.3 | 45.4 | 71.8 | 62.4 | 59.0 | 46.0 | 38.7 | 70.2 |
| VST-3B-RL | 57.7 | 66.6 | 45.0 | 72.8 | 60.9 | 59.9 | 47.6 | 40.7 | 68.3 |
| VST-7B-SFT | 60.6 | 72.0 | 44.4 | 74.3 | 68.3 | 59.7 | 55.8 | 44.9 | 65.2 |
| VST-7B-RL | 61.2 | 71.6 | 43.8 | 75.5 | 69.2 | 60.0 | 55.6 | 44.3 | 69.2 |
Quickstart
pip install transformers
# It's highly recommanded to use `[decord]` feature for faster video loading.
pip install qwen-vl-utils
Using 🤗 Transformers to Chat
Here we show a code snippet to show you how to use the chat model with transformers and qwen_vl_utils:
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
model_path="rayruiyang/VST-3B-SFT"
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
)
# default processer
processor = AutoProcessor.from_pretrained(model_path, min_pixels = 256*28*28, max_pixels=1280*28*28)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "http://images.cocodataset.org/train2017/000000039685.jpg",
},
{"type": "text", "text": "Consider the real-world 3D locations of the objects. Is the flag directly underneath the airplane?"},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])
Citation
If you find our work helpful, feel free to give us a cite.
@article{vst,
title={Visual Spatial Tuning},
author={Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, Hengshuang Zhao},
journal={arXiv preprint arXiv:2511.05491},
year={2025}
}
- Downloads last month
- 74