Video-Text-to-Text
Transformers
Safetensors
English
llava
text-generation
multimodal
vision-language
video understanding
spatial reasoning
visuospatial cognition
qwen
llava-video
Instructions to use nkkbr/ViCA-ScanNet with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nkkbr/ViCA-ScanNet with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForCausalLM processor = AutoProcessor.from_pretrained("nkkbr/ViCA-ScanNet") model = AutoModelForCausalLM.from_pretrained("nkkbr/ViCA-ScanNet") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| tags: | |
| - multimodal | |
| - vision-language | |
| - video understanding | |
| - spatial reasoning | |
| - visuospatial cognition | |
| - llava | |
| - qwen | |
| - llava-video | |
| datasets: | |
| - nkkbr/ViCA-322K | |
| - nkkbr/ViCA-thinking-2.68k | |
| language: | |
| - en | |
| library_name: transformers | |
| pipeline_tag: video-text-to-text | |
| model_name: ViCA-ScanNet-7B | |
| base_model: lmms-lab/LLaVA-Video-7B-Qwen2 | |
| ## Usage and Full Documentation | |
| For detailed model description, training setup, datasets, evaluation results, and inference code, **please refer to the main ViCA-7B README**: | |
| [**nkkbr/ViCA**](https://huggingface.co/nkkbr/ViCA) |