This model is a TensorFlow port of DINO [1] ViT B-16 [2]. The backbone of this model was pre-trained using the DINO pretext task. After that its head layer was trained by keeping the backbone frozen. ImageNet-1k dataset was used for training purposes. You can refer to this notebook to know how the porting was done.

References

[1] Emerging Properties in Self-Supervised Vision Transformers: https://arxiv.org/abs/2104.14294

[2] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using probing-vits/vit-dino-base16 1

Papers for probing-vits/vit-dino-base16

Emerging Properties in Self-Supervised Vision Transformers

Paper • 2104.14294 • Published Apr 29, 2021 • 4

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Paper • 2010.11929 • Published Oct 22, 2020 • 20