Add Model Card

#1
by egrace479 - opened
Files changed (1) hide show
  1. README.md +308 -1
README.md CHANGED
@@ -1,3 +1,310 @@
1
  ---
2
- license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ tags:
6
+ - biology
7
+ - CV
8
+ - images
9
+ - animals
10
+ - image classification
11
+ - fine-grained classification
12
+ - birds
13
+ - pets
14
+ - interpretable
15
+ - transformers
16
+ - vision transformer
17
+ - prompt tuning
18
+ - explainable AI
19
+ - interpretable machine learning
20
+ - saliency map
21
+ - vision transformer
22
+ - dino
23
+ metrics:
24
+ - accuracy
25
+ model_description: >-
26
+ Prompt-CAM is a simple yet effective interpretable transformer for fine-grained
27
+ image classification and analysis. It injects class-specific prompts into any
28
+ pretrained Vision Transformer (ViT) to produce per-class attention maps without
29
+ requiring any architectural modifications. This allows for exploration of fine-grained
30
+ trait distinctions between different specified species.
31
  ---
32
+
33
+ # Model Card for Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis
34
+
35
+ Prompt-CAM checkpoints trained with a ViT-B DINO and DINOv2 backbone on fine-grained image classification datasets (CUB-200-2011, Oxford-IIIT Pet, Stanford Cars, Stanford Dogs, Birds-525). These checkpoints can be used to produce per-class attention maps to explore fine-grained trait distinctions between different specified species.
36
+
37
+ <!-- This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1). And further altered to suit Imageomics Institute needs -->
38
+
39
+ ## Model Details
40
+
41
+ ### Model Description
42
+
43
+ Prompt-CAM is a **simple yet effective interpretable transformer** that requires no architectural modifications to pretrained ViTs. It injects **class-specific prompts** into any ViT to make attention maps interpretable for fine-grained analysis. The prompts act as class queries, and the resulting cross-attention between prompts and image patches produces human-interpretable heatmaps highlighting the visual traits the model uses to distinguish each class.
44
+
45
+ - **Developed by:** Arpita Chowdhury, Dipanjyoti Paul, Zheda Mai, Jianyang Gu, Ziheng Zhang, Kazi Sajeed Mehrab, Elizabeth G. Campolongo, Daniel Rubenstein, Charles V. Stewart, Anuj Karpatne, Tanya Berger-Wolf, Yu Su, and Wei-Lun Chao
46
+ - **Model type:** Vision Transformer with class-specific prompt injection
47
+ - **License:** MIT
48
+ - **Fine-tuned from model:** [ViT-B DINO](https://dl.fbaipublicfiles.com/dino/dino_vitbase16_pretrain/dino_vitbase16_pretrain.pth) and [ViT-B DINOv2](https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_pretrain.pth)
49
+
50
+ ### Model Sources
51
+
52
+ - **Repository:** [Imageomics/Prompt_CAM](https://github.com/Imageomics/Prompt_CAM)
53
+ - **Paper:** [Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis (CVPR 2025)](https://doi.org/10.1109/CVPR52734.2025.00413), [Open-Access](https://openaccess.thecvf.com/content/CVPR2025/papers/Chowdhury_Prompt-CAM_Making_Vision_Transformers_Interpretable_for_Fine-Grained_Analysis_CVPR_2025_paper.pdf) <!-- https://arxiv.org/pdf/2501.09333-->
54
+ - **Demo:** Interactive Colab demo [![](https://img.shields.io/badge/Google_Colab-blue)](https://colab.research.google.com/drive/1co1P5LXSVb-g0hqv8Selfjq4WGxSpIFe?usp=sharing) and local [demo.ipynb](https://github.com/Imageomics/Prompt_CAM/blob/main/demo.ipynb)
55
+
56
+ ## Uses
57
+
58
+ ### Direct Use
59
+
60
+ Prompt-CAM can be used directly for:
61
+ - **Fine-grained image classification** — predicting the species/category of an image among a large set of visually similar classes.
62
+ - **Visual interpretability** — generating per-class attention heatmaps that highlight which image regions and traits the model uses for each class, supporting scientific understanding of what distinguishes species or categories.
63
+
64
+ ### Downstream Use
65
+
66
+ Prompt-CAM can be extended to new fine-grained datasets by following the [extension instructions](https://github.com/Imageomics/Prompt_CAM#to-add-a-new-dataset) in the repository. It is well-suited for biological image datasets where understanding discriminative traits (e.g., plumage patterns, markings) is as important as classification accuracy.
67
+
68
+ ### Out-of-Scope Use
69
+
70
+ - The model is not designed for general-purpose object detection or segmentation.
71
+ - Performance may degrade significantly on image domains far from the training distribution (e.g., applying a bird-trained model to medical images).
72
+
73
+ ## Bias, Risks, and Limitations
74
+
75
+ - Prompt-CAM inherits any biases present in the pretrained ViT backbone (DINO / DINOv2) and in the fine-tuning datasets (e.g., geographic or photographic biases in CUB-200-2011).
76
+ - Classification performance is tied to image quality; low-resolution or heavily occluded images may yield less reliable predictions and attention maps.
77
+
78
+ ### Recommendations
79
+
80
+ Users should treat attention heatmaps as model explanations to be verified with domain expertise rather than ground-truth biological annotations.
81
+
82
+ ## How to Get Started with the Model
83
+
84
+ Set up the environment (using [`env_setup.sh`](https://github.com/Imageomics/Prompt_CAM/blob/main/env_setup.sh)):
85
+
86
+ ```bash
87
+ conda create -n prompt_cam python=3.10
88
+ conda activate prompt_cam
89
+ source env_setup.sh
90
+ ```
91
+
92
+ Download a checkpoint from this repository (see [Training Data](#training-data) table below) and place it in `checkpoints/{backbone}/{dataset}/model.pt`. Then visualize class-specific attention maps by running:
93
+
94
+ ```bash
95
+ CUDA_VISIBLE_DEVICES=0 python visualize.py \
96
+ --config ./experiment/config/prompt_cam/dino/cub/args.yaml \
97
+ --checkpoint ./checkpoints/dino/cub/model.pt \
98
+ --vis_cls 23
99
+ ```
100
+
101
+ Output heatmaps are saved to `visualization/dino/cub/class_23/`.
102
+
103
+ For an interactive experience, see the [Colab demo](https://colab.research.google.com/drive/1co1P5LXSVb-g0hqv8Selfjq4WGxSpIFe?usp=sharing) or [demo.ipynb](https://github.com/Imageomics/Prompt_CAM/blob/main/demo.ipynb).
104
+
105
+ ## Training Details
106
+
107
+ ### Training Data
108
+
109
+ Each checkpoint is trained on the official training split of its respective dataset.
110
+
111
+ | Backbone | Dataset | Checkpoint |
112
+ |----------|---------|------------|
113
+ | DINO (ViT-B/16) | [CUB-200-2011](https://www.vision.caltech.edu/datasets/cub_200_2011/) | [Prompt_CAM_checkpoint_dino_cub.pt](https://huggingface.co/imageomics/Prompt-CAM/resolve/main/Prompt_CAM_checkpoint_dino_cub.pt) |
114
+ | DINO (ViT-B/16) | [Stanford Cars](https://ai.stanford.edu/~jkrause/cars/car_dataset.html) | Coming soon |
115
+ | DINO (ViT-B/16) | [Stanford Dogs](http://vision.stanford.edu/aditya86/ImageNetDogs/) | Coming soon |
116
+ | DINO (ViT-B/16) | [Oxford-IIIT Pet](https://www.robots.ox.ac.uk/~vgg/data/pets/) | Coming soon |
117
+ | DINO (ViT-B/16) | [Birds-525](https://www.kaggle.com/datasets/gpiosenka/100-bird-species) | Coming soon |
118
+ | DINOv2 (ViT-B/14) | [CUB-200-2011](https://www.vision.caltech.edu/datasets/cub_200_2011/) | Coming soon |
119
+ | DINOv2 (ViT-B/14) | [Stanford Dogs](http://vision.stanford.edu/aditya86/ImageNetDogs/) | Coming soon |
120
+ | DINOv2 (ViT-B/14) | [Oxford-IIIT Pet](https://www.robots.ox.ac.uk/~vgg/data/pets/) | Coming soon |
121
+
122
+ ### Training Procedure
123
+
124
+ Only the class-specific prompt tokens are trained; the pretrained ViT backbone weights are kept frozen. The number of prompt tokens equals the number of classes in the dataset.
125
+
126
+ Dataset images are organized as:
127
+
128
+ ```
129
+ dataset_name/
130
+ ├── train/
131
+ │ ├── 001.ClassName/
132
+ │ │ ├── img1.jpg
133
+ │ │ └── ...
134
+ │ └── ...
135
+ └── val/
136
+ ├── 001.ClassName/
137
+ │ ├── img2.jpg
138
+ │ └── ...
139
+ └── ...
140
+ ```
141
+
142
+ Please see the [Data Preparation section](https://github.com/Imageomics/Prompt_CAM#data-preparation) of our GitHub repository for more details on training and validation setup, including preprocessing scripts.
143
+
144
+ #### Preprocessing
145
+
146
+ | Step | Train | Val |
147
+ |------|-------|-----|
148
+ | Resize | 240 × 240 | 224 × 224 |
149
+ | Crop | RandomCrop 224 × 224 | — |
150
+ | Flip | RandomHorizontalFlip | — |
151
+ | Normalize | ImageNet Inception mean/std | ImageNet Inception mean/std |
152
+
153
+ #### Training Hyperparameters
154
+
155
+ | Hyperparameter | Value |
156
+ |----------------|-------|
157
+ | Optimizer | SGD |
158
+ | Learning rate | 0.001 – 0.005 (dataset-dependent) |
159
+ | Min LR | 1e-6 |
160
+ | Momentum | 0.9 |
161
+ | Weight decay | 0.001 |
162
+ | Epochs | 100 – 130 |
163
+ | Warmup epochs | 20 |
164
+ | Warmup LR init | 1e-6 |
165
+ | Batch size | 16 |
166
+ | Drop path rate | 0.0 |
167
+ | VPT dropout | 0.0 |
168
+ | Precision | fp32 |
169
+
170
+ #### Speeds, Sizes, Times
171
+
172
+ - **Hardware:** NVIDIA RTX A6000
173
+ - **Training time:** ≤ 1 hour per checkpoint
174
+ - **Checkpoint size:** ~350 MB (ViT-B backbone + prompt tokens)
175
+
176
+ ## Evaluation
177
+
178
+ To evaluate a checkpoint on the test split, run:
179
+
180
+ ```bash
181
+ CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 main.py \
182
+ --config ./experiment/config/prompt_cam/dino/cub/args.yaml \
183
+ --gpu_num 4
184
+ ```
185
+
186
+ ### Testing Data, Factors & Metrics
187
+
188
+ #### Testing Data
189
+
190
+ Each model is evaluated on the official test (val) split of its training dataset.
191
+
192
+ | Backbone | Dataset | Checkpoint |
193
+ |----------|---------|------------|
194
+ | DINO (ViT-B/16) | [CUB-200-2011](https://www.vision.caltech.edu/datasets/cub_200_2011/) | [Prompt_CAM_checkpoint_dino_cub.pt](https://huggingface.co/imageomics/Prompt-CAM/resolve/main/Prompt_CAM_checkpoint_dino_cub.pt) |
195
+ | DINO (ViT-B/16) | [Stanford Cars](https://ai.stanford.edu/~jkrause/cars/car_dataset.html) | Coming soon |
196
+ | DINO (ViT-B/16) | [Stanford Dogs](http://vision.stanford.edu/aditya86/ImageNetDogs/) | Coming soon |
197
+ | DINO (ViT-B/16) | [Oxford-IIIT Pet](https://www.robots.ox.ac.uk/~vgg/data/pets/) | Coming soon |
198
+ | DINO (ViT-B/16) | [Birds-525](https://www.kaggle.com/datasets/gpiosenka/100-bird-species) | Coming soon |
199
+ | DINOv2 (ViT-B/14) | [CUB-200-2011](https://www.vision.caltech.edu/datasets/cub_200_2011/) | Coming soon |
200
+ | DINOv2 (ViT-B/14) | [Stanford Dogs](http://vision.stanford.edu/aditya86/ImageNetDogs/) | Coming soon |
201
+ | DINOv2 (ViT-B/14) | [Oxford-IIIT Pet](https://www.robots.ox.ac.uk/~vgg/data/pets/) | Coming soon |
202
+
203
+ #### Metrics
204
+
205
+ Top-1 classification accuracy on the official test split.
206
+
207
+ ### Results
208
+
209
+ | Backbone | Dataset | acc@1 |
210
+ |----------|---------|-------|
211
+ | DINO (ViT-B/16) | CUB-200-2011 | 73.2 |
212
+ | DINO (ViT-B/16) | Stanford Cars | 83.2 |
213
+ | DINO (ViT-B/16) | Stanford Dogs | 81.1 |
214
+ | DINO (ViT-B/16) | Oxford-IIIT Pet | 91.3 |
215
+ | DINO (ViT-B/16) | Birds-525 | 98.8 |
216
+ | DINOv2 (ViT-B/14) | CUB-200-2011 | 74.1 |
217
+ | DINOv2 (ViT-B/14) | Stanford Dogs | 81.3 |
218
+ | DINOv2 (ViT-B/14) | Oxford-IIIT Pet | 92.7 |
219
+
220
+ ## Environmental Impact
221
+
222
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://doi.org/10.48550/arXiv.1910.09700).
223
+
224
+ - **Hardware Type:** NVIDIA RTX A6000
225
+ - **Hours used:** ≤ 1 hour per checkpoint
226
+ - **Cloud Provider:** N/A (local cluster)
227
+ - **Compute Region:** United States
228
+ - **Carbon Emitted:** ~0.13 kg CO<sub>2</sub> eq. per checkpoint
229
+
230
+ ## Technical Specifications
231
+
232
+ ### Model Architecture and Objective
233
+
234
+ Prompt-CAM adds a set of learnable class-specific prompt tokens to the input sequence of a frozen pretrained ViT. Each prompt token corresponds to one class. During the forward pass, the self-attention between each class prompt and the image patch tokens produces a spatial attention map that reveals which patches are most relevant for that class. Only the prompt tokens are optimized during training; all ViT parameters remain frozen.
235
+
236
+ ### Compute Infrastructure
237
+
238
+ #### Hardware
239
+
240
+ NVIDIA RTX A6000 GPU.
241
+
242
+ #### Software
243
+
244
+ - Python 3.10
245
+ - PyTorch
246
+ - timm 1.0.24
247
+
248
+ See [`env_setup.sh`](https://github.com/Imageomics/Prompt_CAM/blob/main/env_setup.sh) for the full environment.
249
+
250
+ ## Citation
251
+
252
+ **BibTeX:**
253
+
254
+ [![Paper](https://img.shields.io/badge/Paper-2501.09333-blue)](https://doi.org/10.1109/CVPR52734.2025.00413) [![Open-Access Paper](https://img.shields.io/badge/Paper-Open--Access-blue)](https://openaccess.thecvf.com/content/CVPR2025/papers/Chowdhury_Prompt-CAM_Making_Vision_Transformers_Interpretable_for_Fine-Grained_Analysis_CVPR_2025_paper.pdf)
255
+
256
+ If you find our work helpful, please consider citing our paper:
257
+
258
+ ```bibtex
259
+ @inproceedings{Chowdhury_2025_CVPR,
260
+ author = {Chowdhury, Arpita and Paul, Dipanjyoti and Mai, Zheda and Gu, Jianyang and Zhang, Ziheng and Mehrab, Kazi Sajeed and Campolongo, Elizabeth G. and Rubenstein, Daniel and Stewart, Charles V. and Karpatne, Anuj and Berger-Wolf, Tanya and Su, Yu and Chao, Wei-Lun},
261
+ title = {Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis},
262
+ booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
263
+ month = {June},
264
+ year = {2025},
265
+ pages = {4375--4385},
266
+ doi = {10.1109/CVPR52734.2025.00413}
267
+ }
268
+ ```
269
+
270
+ Model Citation:
271
+
272
+ ```bibtex
273
+ @software{Chowdhury_Prompt_CAM_2025,
274
+ author = {Chowdhury, Arpita and Paul, Dipanjyoti and Mai, Zheda and Gu, Jianyang and Zhang, Ziheng and Mehrab, Kazi Sajeed and Campolongo, Elizabeth G. and Rubenstein, Daniel and Stewart, Charles V. and Karpatne, Anuj and Berger-Wolf, Tanya and Su, Yu and Chao, Wei-Lun},
275
+ license = {MIT},
276
+ title = {{Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis}},
277
+ doi = {<doi once generated>},
278
+ url = {https://huggingface.co/imageomics/Prompt-CAM},
279
+ version = {1.0.0},
280
+ month = {June},
281
+ year = {2026}
282
+ }
283
+ ```
284
+
285
+ **APA:**
286
+
287
+ Paper:
288
+
289
+ Chowdhury, A., Paul, D., Mai, Z., Gu, J., Zhang, Z., Mehrab, K. S., Campolongo, E. G., Rubenstein, D., Stewart, C. V., Karpatne, A., Berger-Wolf, T., Su, Y., & Chao, W.-L. (2025). Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis. *2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, 4375–4385. doi:10.1109/CVPR52734.2025.00413
290
+ Model Citation:
291
+
292
+ Chowdhury, A., Paul, D., Mai, Z., Gu, J., Zhang, Z., Mehrab, K. S., Campolongo, E. G., Rubenstein, D., Stewart, C. V., Karpatne, A., Berger-Wolf, T., Su, Y., & Chao, W.-L. (2025). Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis (Version 1.0.0). https://huggingface.co/imageomics/Prompt-CAM
293
+
294
+ ## Acknowledgements
295
+
296
+ Our model builds on pretrained [DINO](https://github.com/facebookresearch/dino) and [DINOv2](https://github.com/facebookresearch/dinov2) ViT backbones. We thank the authors for their excellent work.
297
+
298
+ We also acknowledge:
299
+ - [VPT](https://github.com/KMnP/vpt)
300
+ - [PETL_VISION](https://github.com/OSU-MLB/PETL_Vision)
301
+
302
+ This work was supported by the [Imageomics Institute](https://imageomics.org), which is funded by the US National Science Foundation's Harnessing the Data Revolution (HDR) program under [Award #2118240](https://www.nsf.gov/awardsearch/showAward?AWD_ID=2118240) (Imageomics: A New Frontier of Biological Information Powered by Knowledge-Guided Machine Learning). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
303
+
304
+ ## Model Card Authors
305
+
306
+ Arpita Chowdhury
307
+
308
+ ## Model Card Contact
309
+
310
+ Arpita Chowdhury ��� [GitHub Issues](https://github.com/Imageomics/Prompt_CAM/issues) - email: arpitachowdhurytonney@gmail.com