Instructions to use prithivMLmods/DeepCaption-VLA-V2.0-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use prithivMLmods/DeepCaption-VLA-V2.0-7B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="prithivMLmods/DeepCaption-VLA-V2.0-7B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("prithivMLmods/DeepCaption-VLA-V2.0-7B")
model = AutoModelForImageTextToText.from_pretrained("prithivMLmods/DeepCaption-VLA-V2.0-7B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use prithivMLmods/DeepCaption-VLA-V2.0-7B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "prithivMLmods/DeepCaption-VLA-V2.0-7B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prithivMLmods/DeepCaption-VLA-V2.0-7B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/prithivMLmods/DeepCaption-VLA-V2.0-7B

SGLang

How to use prithivMLmods/DeepCaption-VLA-V2.0-7B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "prithivMLmods/DeepCaption-VLA-V2.0-7B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prithivMLmods/DeepCaption-VLA-V2.0-7B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "prithivMLmods/DeepCaption-VLA-V2.0-7B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prithivMLmods/DeepCaption-VLA-V2.0-7B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use prithivMLmods/DeepCaption-VLA-V2.0-7B with Docker Model Runner:
```
docker model run hf.co/prithivMLmods/DeepCaption-VLA-V2.0-7B
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

DeepCaption-VLA-V2.0-7B

DeepCaption-VLA-V2.0-7B is an advanced fine-tuned version of Qwen2.5-VL-7B-Instruct, specialized for Image Captioning and Vision Language Attribution (VLA). This enhanced release focuses on generating precise, attribute-rich captions that capture visual properties, object attributes, and scene details across diverse image types and aspect ratios.

Version V2.0 introduces significant improvements in multilingual inference, delivering higher captioning quality and attribution accuracy in languages including Chinese (Zh), Thai (Th), and others.

Key Highlights

Vision Language Attribution (VLA): Fine-tuned to attribute and define visual properties of objects, scenes, and environments with greater semantic precision.
Detailed Object Definitions: Generates attribute-rich captions, offering deeper visual understanding compared to generic captioning models.
High-Fidelity Descriptions: Excels at describing general, artistic, technical, abstract, and low-context images with enhanced descriptive detail.
Robust Across Aspect Ratios: Maintains caption accuracy across various formats — wide, tall, square, or irregular.
Variational Detail Control: Supports both concise summaries and fine-grained visual attributions depending on prompt structure.
Enhanced Multilingual Inference (New in V2.0): Optimized for generating accurate and descriptive captions in multiple languages, including English, Chinese (Zh), Thai (Th), and more.
Built on Qwen2.5-VL Architecture: Leverages the multimodal reasoning capabilities and instruction-following strengths of Qwen2.5-VL-7B.

model type: experimental

Sample Inferences [en, zh, thai] - [DeepCaption-VLA-V2.0-7B]

Image 1	Image 2
$output\_08ab9086-6734-4d7d-a325-a8468dac32a9-1$	$output\_9e6c2b4e-a250-4eef-a45d-8ee9d901fdb4-1$
Image 3	Image 4
$output\_50c5b853-e849-453e-8d6a-cd55446b7e5e-1$	$output\_56cd6bc4-1f6e-4834-b949-a386fcef1037-1$
Image 5	Image 6
$output\_56627187-e752-4cdf-93b9-776377908382-1$	$output\_cd987d54-5812-41d9-8f75-71036d1f4bd3-1$
Image 7 [zh]	Image 8
$output\_d5f58601-e303-4ea8-9ee2-ea935dcac1b5-1$	$output\_d113bd7f-7d7f-4524-a941-ecd4fcd97eb0-1$
Image 9 [zh]	Image 10 [thai]
$output\_d5217ad1-10de-4bce-811c-b10658eecd7f-1$	$output\_f0387f11-4a61-4848-8cba-e32e422374b2-1$

Comparison of Inference: Qwen2.5-VL-7B vs. DeepCaption-VLA-V2.0-7B

Qwen2.5-VL-7B-Instruct	DeepCaption-VLA-V2.0-7B
$output\_5430db23-c599-440f-aa4a-b05ff91d9d91-1$	$output\_5d06a443-21ad-4bfd-8de5-6345f3383b62-1$
$output\_baf07cf2-07d7-4877-98db-7fa040745e23-1$

Example of a Recommended System Instruction

CAPTION_SYSTEM_PROMPT = """
You are an AI assistant that rigorously follows this response protocol:

1. For every input image, your primary task is to write a **precise caption**. The caption must capture the **essence of the image** in clear, concise, and contextually accurate language.

2. Along with the caption, provide a structured set of **attributes** that describe the visual elements. Attributes should include details such as objects, people, actions, colors, environment, mood, and other notable characteristics.

3. Always include a **class_name** field. This must represent the **core theme or main subject** of the image in a compact format.  
   - Use the syntax: `{class_name==write_the_core_theme}`  
   - Example: `{class_name==dog_playing}` or `{class_name==city_sunset}`  

4. Maintain the following strict format in your output:
   - **Caption:** <one-sentence description>  
   - **Attributes:** <comma-separated list of visual attributes>  
   - **{class_name==core_theme}**

5. Ensure captions are **precise, neutral, and descriptive**, avoiding unnecessary elaboration or subjective interpretation unless explicitly required.

6. Do not reference the rules or instructions in the output. Only return the formatted caption, attributes, and class_name.
"""

Quick Start with Transformers

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/DeepCaption-VLA-V2.0-7B", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("prithivMLmods/DeepCaption-VLA-V2.0-7B")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image with detailed attributes and properties."},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Intended Use

Generating attribute-rich image captions for research, dataset creation, and AI training.
Vision-language attribution for object detection, scene understanding, and dataset annotation.
Supporting creative, artistic, and technical applications requiring descriptive image understanding.
Captioning across varied aspect ratios, non-standard datasets, and multilingual contexts.

Limitations

May over-attribute or infer properties not explicitly visible in ambiguous or low-resolution images.
Caption tone and level of detail may vary depending on prompt phrasing.
Not intended for filtered captioning tasks; explicit or sensitive content may still appear.
Performance may degrade slightly on highly synthetic or abstract visual domains.