Encoder-free nanoVLM is a minimal Vision-Language Model with no vision encoder. Images are resized (aspect-ratio preserving), patchified, and fed as raw pixel patches straight into the language model via a thin embedder + projector — the number of image ("soft") tokens varies with image size. The language backbone is a HuggingFace causal LM (Qwen/Qwen3-1.7B by default).

Usage:

This is a custom architecture, so the model is loaded with the repo's own code (not transformers directly). Clone the repo, install its dependencies, then:

from models.vision_language_model import VisionLanguageModel

model = VisionLanguageModel.from_pretrained("ndrugov/nanoVLM-encoder-free")

Downloads last month: 14

Safetensors

Model size

2B params

Tensor type

F32

BF16

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support