Model Overview
Description
This model performs visual feature extraction. Unlike standard Vision Transformers that produce a fixed 2D grid of patch features, RADIO1D compresses an input image into a compact, variable-length 1D sequence of tokens. The number of output tokens (from 1 up to 256) can be selected by the user at inference time, providing a continuous accuracy/efficiency trade-off. For example, an image can be summarized into a single token for retrieval, or expanded to 256 tokens for fine-grained tasks such as OCR.
RADIO1D was produced by fine-tuning C-RADIOv4-H using multi-teacher agglomerative distillation from:
The encoder integrates a learnable Patch Merging block (4× sequence-length reduction, 2× channel expansion) part-way through the network for efficiency, and a lightweight Vision Transformer decoder is used only during training to project the 1D tokens back into a 2D-compatible grid for teacher alignment. At inference, only the encoder runs.
This model is ready for commercial or non-commercial use.
License/Terms of Use
GOVERNING TERMS: Use of this model is governed by the NVIDIA Open Model License Agreement.
Deployment Geography
Global
Use Case
The embeddings generated by this model are expected to be used by a downstream application. The variable-length 1D token output makes RADIO1D especially well suited to:
- Integration into a Vision-Language Model with a user-tunable token budget per image (trade off accuracy vs. time-to-first-token).
- Image-level understanding (image classification, scene summarization, image curation).
- Composition-aware image-to-image retrieval (object presence and spatial arrangement).
- Dense processing (semantic segmentation) when a decoder is used.
Release Date
Hugging Face: 07/01/2026 via RADIO Collection of Models.
References
- RADIO1D paper: G. Heinrich, M. Ranzinger, C. McCarthy et al., "RADIO1D: Elastic Representations for Condensed Vision Modeling," ICML 2026.
- AM-RADIO (arXiv:2312.06709)
- RADIOv2.5 (arXiv:2412.07679)
- PHI-S / RADIO scaling (arXiv:2410.01680)
- C-RADIO (arXiv:2502.16025)
- C-RADIOv4 (arXiv:2601.17237)
Model Architecture
Architecture Type: Neural Network
Network Architecture: Vision Transformer with encoder-decoder for elastic 1D token generation
Number of model parameters: ~1.14B (encoder, used at inference); ~314M additional decoder parameters used only during training
The RADIO1D-H encoder is built from a ViT-H/16 backbone. Image patches (16×16 pixels) are flattened into a 1D sequence and processed by 24 transformer blocks at embedding dimension 1280, followed by a learnable Patch Merging block that groups 2×2 neighboring tokens (reducing sequence length by 4× and expanding the channel dimension by ρ=2 to 2560), followed by 8 further transformer blocks at the wider dimension. During training, a length ℓ is sampled stochastically from a triangular distribution p(x)=2−2x, and only the first ℓ encoder tokens are retained (a form of nested dropout). Earlier tokens are therefore encouraged to encode global, high-level semantics while later tokens specialize in finer details.
A small ViT decoder (~314M parameters, 6 blocks with one Patch Splitting upscale) is used only during training. It receives duplicated learnable query tokens equal in number to the original patch grid plus the encoder's ℓ tokens as additional register tokens, and cross-attention reconstructs a 2D-compatible feature grid that is aligned to the teacher representations.
At inference, only the encoder is used and the user specifies the desired number of output tokens.
Input
Input Type(s): Image
Input Format(s): Red, Green, Blue (RGB)
Input Parameters: Two Dimensional (2D)
Other Properties Related to Input: Image resolutions up to 2048×2048 in increments of 16 pixels. Training used a mix of low-resolution images (128, 192, 224, 256, 384, 432 px) and high-resolution images (512, 768, 1024, 1152 px).
Output
Output Type(s): Embeddings
Output Format: Tensor
Output Parameters: One Dimensional (1D) — variable-length sequence of tokens
Other Properties Related to Output: The encoder returns a sequence of prefix tokens (CLS + register tokens) followed by ℓ global 1D tokens, where ℓ is selected by the caller at inference (typical values: 1, 8, 32, 64, 128, 192, 224, 256). Tokens are ordered hierarchically: token 0 encodes the strongest global summary (e.g., 85.0% k-NN Top-1 on ImageNet-1k with a single token), and later tokens add progressively finer detail. A downstream model is required to leverage the image features. Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g., GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Software Integration
Runtime Engine(s):
- PyTorch
Supported Hardware Microarchitecture Compatibility:
- NVIDIA Ampere
- NVIDIA Blackwell
- NVIDIA Jetson
- NVIDIA Hopper
- NVIDIA Lovelace
- NVIDIA Pascal
- NVIDIA Turing
- NVIDIA Volta
[Preferred/Supported] Operating System(s):
- Linux
- Linux 4 Tegra
- QNX
- Windows
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
This AI model can be embedded as an Application Programming Interface (API) call into the software environment described above.
Model Version(s)
- C-RADIOv4-1D-H (~1.14B parameters; based on a ViT-H/16 backbone with learnable Patch Merging at block 24 and expansion factor ρ=2).
Links:
Training and Evaluation Datasets
Training Dataset
NV-CC-Img-Text-Dataset
- Data Modality: Image
- Image Training Data Size: 1 Million to 1 Billion Images
- Data Collection Method by dataset: Automated
- Labeling Method by dataset: Not Applicable (no labels are needed; supervision comes from teacher models via multi-teacher distillation)
- Properties: ~172M total training samples processed over 300k optimizer steps (less than one epoch over the source dataset). Global batch size of 512 low-resolution images (sampled from 128, 192, 224, 256, 384, 432 px) plus 64 high-resolution images (from 512, 768, 1024, 1152 px).
Evaluation Datasets
ADE20K
- Link: ADE20K
- Data Collection: Manually-Collected
- Labeling Method: Manually-Collected
- Training Images: 25,574
- Validation Images: 2,000
ImageNet
- Link: ImageNet
- Data Collection: Automated
- Labeling Method: Manually-Collected
- Training Images: 1,281,167
- Validation Images: 50,000
For downstream VLM evaluation, RADIO1D was paired with the Nemotron-Nano-9B-v2 LLM in the Nemotron VL framework and evaluated on TextVQA, DocVQA, InfoVQA, OCRBench, OCRBench v2 (EN/CN), AI2D, ChartQA, MMMU, SeedBench, and LongVideoBench.
ADE20K linear-probe mIoU vs. number of 1D tokens
The decoder's reconstructed 2D feature grid is used with a frozen linear probe.
| Tokens | 1 | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 |
|---|---|---|---|---|---|---|---|---|---|
| mIoU | 40.23 | 46.84 | 51.14 | 54.11 | 55.22 | 55.83 | 55.60 | 55.63 | 55.68 |
For comparison, with a 1024-token feature grid C-RADIOv4-H reaches 55.20 mIoU and DINOv3-H+ reaches 54.80 mIoU on the same linear-probe protocol — RADIO1D matches or exceeds these scores using only 32 tokens.
Vision-language modeling (Nemotron-Nano-9B-v2, 17M SFT samples)
Average accuracy across 11 multimodal benchmarks (TextVQA, DocVQA, InfoVQA, OCRBench, OCRBench v2 EN, OCRBench v2 CN, AI2D, ChartQA, MMMU, SeedBench, LongVideoBench), with H100 time-to-first-token (TTFT) measured in vLLM at 32 images and 128 LLM context tokens.
| Vision Encoder | Tokens/tile | TTFT (ms) | Avg. accuracy |
|---|---|---|---|
| C-RADIOv4-H | 256 | 468.2 | 73.09 |
| SigLIP2-SO400m | 256 | 440.5 | 72.67 |
| SigLIP2-g | 256 | 517.6 | 72.81 |
| RADIO1D-H | 1 | 327.0 | 51.53 |
| RADIO1D-H | 8 | 329.7 | 63.88 |
| RADIO1D-H | 32 | 335.1 | 68.13 |
| RADIO1D-H | 64 | 346.1 | 70.07 |
| RADIO1D-H | 128 | 373.2 | 71.64 |
| RADIO1D-H | 192 | 380.4 | 72.36 |
| RADIO1D-H | 224 | 410.2 | 73.02 |
| RADIO1D-H | 256 | 452.8 | 73.29 |
Inference
Acceleration Engine: TensorRT, TensorRT-LLM
Engine: PyTorch
Test Hardware: NVIDIA Hopper (H100)
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. Developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included.
For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards below.
Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.
Bias
| Field | Response |
|---|---|
| Participation considerations from adversely impacted groups protected classes in model design and testing: | None |
| Measures taken to mitigate against unwanted bias: | None |
| Bias Metric (If Measured): | None |
Explainability
| Field | Response |
|---|---|
| Intended Task/Domain: | Visual Feature Extraction (variable-length 1D token sequence) |
| Model Type: | Vision Transformer with elastic 1D token bottleneck and (training-only) decoder |
| Intended Users: | Developers of downstream vision and vision-language applications |
| Output: | Variable-length sequence of 1D image embedding tokens (1–256 tokens, user-specified at inference) |
| Describe how the model works: | The model takes an image as input, processes the image through transformer blocks (with one learnable Patch Merging downscale), and outputs a hierarchical 1D sequence in which early tokens summarize global semantics and later tokens encode finer details. The user selects how many tokens to keep at inference. |
| Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: | Not Applicable |
| Technical Limitations: | This model generates image embeddings that a downstream model must be trained to consume. The model is only tested on input resolutions ranging from 128 to 2048 pixels, in increments of 16 pixels. With very few output tokens (≤ 32) accuracy degrades on tasks requiring dense reading (e.g., DocVQA, InfoVQA, OCRBench), where the full 256-token budget is recommended. The model may fail to surface fine-grained orientation cues (e.g., whether a sign points left or right) and, like other vision foundation models, may not disambiguate visually similar concepts (e.g., different breeds of dog) without downstream fine-tuning. |
| Verified to have met prescribed NVIDIA quality standards: | Yes |
| Performance Metrics: | ADE20K linear-probe mIoU as a function of token count, and multimodal benchmark accuracy in a Nemotron VL pipeline. |
| Potential Known Risks: | This model may not perform well on visual domains that are not represented in the training data. The generated embeddings might fail to disambiguate differences that appear evident to humans. Domain-specific evaluation is required for the target application. Aggressive token compression (≤ 32 tokens) trades fine-grained spatial fidelity for efficiency and should be evaluated for the target task before deployment. |
| Licensing: | NVIDIA Open Model License |
Privacy
| Field | Response |
|---|---|
| Generatable or reverse engineerable personal data? | No |
| Personal data used to create this model? | No |
| How often is dataset reviewed? | Before Every Release |
| Is there provenance for all datasets used in training? | Yes |
| Does data labeling (annotation, metadata) comply with privacy laws? | Yes |
| Is data compliant with data subject requests for data correction or removal, if such a request was made? | Yes |
| Was data from user interactions with the AI model (e.g. user input and prompts) used to train the model? | No |
| Applicable Privacy Policy | https://www.nvidia.com/en-us/about-nvidia/privacy-policy/ |
Safety
| Field | Response |
|---|---|
| Model Application Field(s): | Generation of visual embeddings |
| Describe the life critical impact (if present). | Not Applicable |
| Use Case Restrictions: | Abide by NVIDIA Open Model License Agreement |
| Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. |
- Downloads last month
- -