Encoder-free nanoVLM is a minimal Vision-Language Model with no vision encoder. Images are resized (aspect-ratio preserving), patchified, and fed as raw pixel patches straight into the language model via a thin embedder + projector — the number of image ("soft") tokens varies with image size. The language backbone is a HuggingFace causal LM (Qwen/Qwen3-1.7B by default).
Usage:
This is a custom architecture, so the model is loaded with the repo's own code (not transformers directly).
Clone the repo, install its dependencies, then:
from models.vision_language_model import VisionLanguageModel
model = VisionLanguageModel.from_pretrained("ndrugov/nanoVLM-encoder-free")
- Downloads last month
- 14
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support