Video-Text-to-Text
Transformers
Safetensors
English
llava
text-generation
multimodal
vision-language
video understanding
spatial reasoning
visuospatial cognition
qwen
llava-video
Instructions to use nkkbr/ViCA-ScanNet with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nkkbr/ViCA-ScanNet with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForCausalLM processor = AutoProcessor.from_pretrained("nkkbr/ViCA-ScanNet") model = AutoModelForCausalLM.from_pretrained("nkkbr/ViCA-ScanNet") - Notebooks
- Google Colab
- Kaggle
File size: 612 Bytes
2a6de88 a94043e 2a6de88 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | ---
license: apache-2.0
tags:
- multimodal
- vision-language
- video understanding
- spatial reasoning
- visuospatial cognition
- llava
- qwen
- llava-video
datasets:
- nkkbr/ViCA-322K
- nkkbr/ViCA-thinking-2.68k
language:
- en
library_name: transformers
pipeline_tag: video-text-to-text
model_name: ViCA-ScanNet-7B
base_model: lmms-lab/LLaVA-Video-7B-Qwen2
---
## Usage and Full Documentation
For detailed model description, training setup, datasets, evaluation results, and inference code, **please refer to the main ViCA-7B README**:
[**nkkbr/ViCA**](https://huggingface.co/nkkbr/ViCA) |