Instructions to use OedoSoldier/ViGOS-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use OedoSoldier/ViGOS-7B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="OedoSoldier/ViGOS-7B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("OedoSoldier/ViGOS-7B") model = AutoModelForMultimodalLM.from_pretrained("OedoSoldier/ViGOS-7B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use OedoSoldier/ViGOS-7B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "OedoSoldier/ViGOS-7B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OedoSoldier/ViGOS-7B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/OedoSoldier/ViGOS-7B
- SGLang
How to use OedoSoldier/ViGOS-7B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "OedoSoldier/ViGOS-7B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OedoSoldier/ViGOS-7B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "OedoSoldier/ViGOS-7B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OedoSoldier/ViGOS-7B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use OedoSoldier/ViGOS-7B with Docker Model Runner:
docker model run hf.co/OedoSoldier/ViGOS-7B
ViGOS-7B: Visual Grounding On-Policy Self-Distillation
Model Details
| Field | Value |
|---|---|
| Model name | ViGOS-7B |
| Repository ID | OedoSoldier/ViGOS-7B |
| Model family | ViGOS |
| Model type | Multimodal image-text-to-text / vision-language reasoning model |
| Base model | Qwen/Qwen2.5-VL-7B-Instruct |
| Training method | Segment-wise multimodal on-policy self-distillation |
| Weight format | Merged full weights |
| Training data | LMMs-Lab-Turtle/Vision-SR1-47K |
| Output format | <description>...</description><think>...</think>\boxed{...} |
| Paper | Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation |
| Authors | Sihan Wang, Xiyao Liu, Lianqing Liu, Zhi Han |
| Code | https://github.com/OedoSoldier/ViGOS |
| License | Apache license 2.0 |
This repository is for the 7B-scale ViGOS model only. The 3B-scale model should use a separate Hugging Face repository and model card.
Model Summary
ViGOS stands for Visual Grounding On-Policy Self-Distillation. It is a multimodal post-training method for reducing shortcut behavior in on-policy self-distillation for vision-language models. In vanilla OPSD, the privileged teacher can see the reference answer while supervising the whole student rollout. For MLLMs, that can make the dense training signal overly answer-driven before the model has grounded its response in image evidence.
ViGOS changes the supervision path by asking the student to first produce a visual description, then reason, then answer:
<description> visual description </description>
<think> reasoning process </think>
\boxed{FINAL ANSWER}
For valid training rollouts, ViGOS uses segment-wise teachers:
- an image-only perception teacher supervises the description tokens;
- a privileged reasoning teacher supervises reasoning and final-answer tokens after the student-generated description prefix exists;
- a reference fallback teacher is used only for invalid or malformed rollouts to recover the required output format.
At inference time, all teachers, reference answers, and segment masks are removed. The model receives only the image, the question or instruction, and the output-format prompt.
Intended Use
This model is intended for research and development in multimodal reasoning tasks, including visual question answering, visual math and diagram reasoning, OCR- or chart-grounded reasoning, spatial reasoning, visual grounding, and shortcut/prior-sensitivity analysis.
Out-of-Scope Use
This model should not be used as the sole decision-maker in high-stakes settings such as medical diagnosis, legal judgment, financial decision-making, safety-critical robotics, surveillance, identity verification, or other contexts where hallucinated or incorrect visual reasoning could cause harm.
How to Use
pip install git+https://github.com/huggingface/transformers accelerate
pip install qwen-vl-utils[decord]
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
MODEL_ID = "OedoSoldier/ViGOS-7B"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
MODEL_ID,
torch_dtype="auto",
device_map="auto",
)
processor = AutoProcessor.from_pretrained(MODEL_ID)
image_path = "path/to/image.jpg"
question = "What is the answer to the visual question?"
prompt = f"""Problem: {question}
You are tasked with analyzing an image to generate a detailed description that can help you answer the question. First analyze the image and produce a self-contained description, detailed enough to lead to the correct answer. Do not include the final answer in the description. Wrap the entire description in <description> </description> tags.
Next, reason step by step based on the image description and the image, and enclose this part within <think> </think> tags.
Finally, provide a single word or phrase answer to the question in \\boxed{{}}.
The output format should be: <description> image description here </description><think> reasoning process here </think> \\boxed{{FINAL ANSWER here}}.
"""
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image_path},
{"type": "text", "text": prompt},
],
}
]
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
).to(model.device)
with torch.no_grad():
generated_ids = model.generate(
**inputs,
max_new_tokens=4096,
do_sample=True,
temperature=1.0,
top_p=0.90,
top_k=20,
)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed,
skip_special_tokens=True,
clean_up_tokenization_spaces=False,
)[0]
print(output_text)
Recommended Answer Extraction
For benchmark-style evaluation, the paper extracts the final answer from the last \boxed{...} span. Outputs without a parseable final answer are counted as incorrect.
Training Details
The paper trains this model for one epoch on Vision-SR1-47K using 8 NVIDIA A100 GPUs. The student is trained on on-policy rollouts, and the frozen teacher roles are used only to score the student-generated prefixes during training.
| Parameter | Value |
|---|---|
| Training epochs | 1 |
| GPUs | 8 × A100 |
| Effective batch size | 32 |
| Optimizer | Fused AdamW |
| Learning rate | 5e-6 |
| LR scheduler | Linear |
| Maximum gradient norm | 0.1 |
| Precision | bf16 |
| Distributed training | ZeRO-2 |
| Maximum prompt length | 32,768 |
| Maximum completion length | 4,096 |
| LoRA rank | 64 |
| LoRA alpha | 128 |
| LoRA dropout | 0.05 |
| LoRA target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Rollout temperature | 1.1 |
| Rollout top-p / top-k | 0.95 / 20 |
| λ_perc | 1.0 |
| λ_rea | 1.0 |
| λ_ref | 2.0 |
| Distillation temperature | 1.0 |
| KL clipping | 0.05 |
Evaluation Protocol
For the eight main benchmarks, the paper samples five stochastic responses per example and reports Pass@5 / Avg@5. Pass@5 checks whether at least one of the five sampled answers is correct, while Avg@5 is the mean correctness across all five samples.
For ViLP, the paper generates one response per prompt and reports Score / Prior. Score measures accuracy on visually diagnostic questions where the model must use the image, and Prior measures accuracy on prior-aligned questions where the common visual-language prior is correct.
Evaluation decoding settings:
| Parameter | Value |
|---|---|
| Maximum generated tokens | 4,096 |
| Number of samples per main benchmark question | 5 |
| Temperature | 1.0 |
| Top-p | 0.90 |
| Top-k | 20 |
| Random seed | 42 |
Evaluation Results
Main Benchmarks
Pass@5 / Avg@5, in percent:
| Benchmark | ViGOS-7B |
|---|---|
| MM-Vet | 72.02 / 54.40 |
| MMMU | 80.11 / 51.42 |
| MMMU-Pro | 64.81 / 36.48 |
| MathVerse | 68.91 / 44.77 |
| MathVista | 80.90 / 58.78 |
| MMSI | 61.10 / 25.58 |
| RealWorldQA | 85.88 / 62.88 |
| CV-Bench | 91.09 / 73.58 |
| Mean across 8 benchmarks | 75.60 / 50.99 |
Prior-Sensitive ViLP Results
Score / Prior, in percent:
| Setting | ViGOS-7B |
|---|---|
| ViLP-F | 62.67 / 97.00 |
| ViLP-P | 61.67 / 91.67 |
Ethical Considerations
Users should validate the model carefully before deployment. The model can generate plausible but incorrect visual descriptions and rationales. In user-facing applications, consider presenting only concise final answers, or clearly mark generated descriptions and rationales as model-generated rather than authoritative evidence.
Citation
Please cite the ViGOS paper if you use this model or method.
@misc{wang2026seeing,
title={Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation},
author={Wang, Sihan and Liu, Xiyao and Liu, Lianqing and Han, Zhi},
year={2026},
eprint={2606.19120},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2606.19120}
}
- Downloads last month
- 60
Model tree for OedoSoldier/ViGOS-7B
Paper for OedoSoldier/ViGOS-7B
Evaluation results
- Mean Pass@5 on Eight Main Benchmarks Averageself-reported75.600
- Mean Avg@5 on Eight Main Benchmarks Averageself-reported50.990
- Score on ViLP-Fself-reported62.670
- Prior on ViLP-Fself-reported97.000
- Score on ViLP-Pself-reported61.670
- Prior on ViLP-Pself-reported91.670