Encoder-free nanoVLM is a minimal Vision-Language Model with no vision encoder. Images are resized (aspect-ratio preserving), patchified, and fed as raw pixel patches straight into the language model via a thin embedder + projector — the number of image ("soft") tokens varies with image size. The language backbone is a HuggingFace causal LM (Qwen/Qwen3-1.7B by default).

Usage:

This is a custom architecture, so the model is loaded with the repo's own code (not transformers directly). Clone the repo, install its dependencies, then:

from models.vision_language_model import VisionLanguageModel

model = VisionLanguageModel.from_pretrained("ndrugov/nanoVLM-encoder-free")
Downloads last month
14
Safetensors
Model size
2B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support