Zero-Shot Image Classification
OpenCLIP
English
OpenCLIP
clip
biology
biodiversity
agronomy
CV
images
animals
species
taxonomy
rare species
endangered species
evolutionary biology
multimodal
knowledge-guided
Instructions to use BGLab/BioTrove-CLIP with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- OpenCLIP
How to use BGLab/BioTrove-CLIP with OpenCLIP:
import open_clip model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:BGLab/BioTrove-CLIP') tokenizer = open_clip.get_tokenizer('hf-hub:BGLab/BioTrove-CLIP') - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| language: | |
| - en | |
| tags: | |
| - zero-shot-image-classification | |
| - OpenCLIP | |
| - clip | |
| - biology | |
| - biodiversity | |
| - agronomy | |
| - CV | |
| - images | |
| - animals | |
| - species | |
| - taxonomy | |
| - rare species | |
| - endangered species | |
| - evolutionary biology | |
| - multimodal | |
| - knowledge-guided | |
| datasets: | |
| - BGLab/BioTrove-Train | |
| base_model: | |
| - openai/clip-vit-base-patch16 | |
| - openai/clip-vit-large-patch14 | |
| pipeline_tag: zero-shot-image-classification | |
| metrics: | |
| - accuracy | |
| # Model Card for BioTrove-CLIP | |
| <!-- Banner links --> | |
| <div style="text-align:center;"> | |
| <a href="https://baskargroup.github.io/BioTrove/" target="_blank"> | |
| <img src="https://img.shields.io/badge/Project%20Page-Visit-blue" alt="Project Page" style="margin-right:10px;"> | |
| </a> | |
| <a href="https://github.com/baskargroup/BioTrove" target="_blank"> | |
| <img src="https://img.shields.io/badge/GitHub-Visit-lightgrey" alt="GitHub" style="margin-right:10px;"> | |
| </a> | |
| <a href="https://pypi.org/project/arbor-process/" target="_blank"> | |
| <img src="https://img.shields.io/badge/PyPI-arbor--process%200.1.0-orange" alt="PyPI biotrove-process 0.1.0"> | |
| </a> | |
| </div> | |
| BioTrove-CLIP is a new suite of vision-language foundation models for biodiversity. These CLIP-style foundation models were trained on [BioTrove-Train](https://huggingface.co/BGLab/BioTrove-Train), which is a large-scale dataset of `40 million` images of `33K species` of plants and animals. The models are evaluated on zero-shot image classification tasks. | |
| - **Model type:** Vision Transformer (ViT-B/16, ViT-L/14) | |
| - **License:** MIT | |
| - **Fine-tuned from model:** [OpenAI CLIP](https://github.com/mlfoundations/open_clip), [MetaCLIP](https://github.com/facebookresearch/MetaCLIP), [BioCLIP](https://github.com/Imageomics/BioCLIP) | |
| These models were developed for the benefit of the AI community as an open-source product. Thus, we request that any derivative products are also open-source. | |
| ### Model Description | |
| BioTrove-CLIP is based on OpenAI's [CLIP](https://openai.com/research/clip) model. | |
| The models were trained on [BioTrove-Train](https://huggingface.co/BGLab/BioTrove-Train) for the following configurations: | |
| - **BioTrove-CLIP-O:** Trained a ViT-B/16 backbone initialized from the [OpenCLIP's](https://github.com/mlfoundations/open_clip) checkpoint. The training was conducted for 40 epochs. | |
| - **BioTrove-CLIP-B:** Trained a ViT-B/16 backbone initialized from the [BioCLIP's](https://github.com/Imageomics/BioCLIP) checkpoint. The training was conducted for 8 epochs. | |
| - **BioTrove-CLIP-M:** Trained a ViT-L/14 backbone initialized from the [MetaCLIP's](https://github.com/facebookresearch/MetaCLIP) checkpoint. The training was conducted for 12 epochs. | |
| To access the checkpoints of the above models, go to the `Files and versions` tab and download the weights. These weights can be directly used for zero-shot classification and finetuning. The filenames correspond to the specific model weights - | |
| - **BioTrove-CLIP-O:** - `biotroveclip-vit-b-16-from-openai-epoch-40.pt`, | |
| - **BioTrove-CLIP-B:** - `biotroveclip-vit-b-16-from-bioclip-epoch-8.pt` | |
| - **BioTrove-CLIP-M** - `biotroveclip-vit-l-14-from-metaclip-epoch-12.pt` | |
| ### Model Training | |
| **See the [Model Training](https://github.com/baskargroup/BioTrove/tree/main/model_training) section on the [Github](https://github.com/baskargroup/BioTrove) for examples of how to use BioTrove-CLIP models in zero-shot image classification tasks.** | |
| We train three models using a modified version of the [BioCLIP / OpenCLIP](https://github.com/Imageomics/bioclip/tree/main/src/training) codebase. Each model is trained on Arboretum-40M, on 2 nodes, 8xH100 GPUs, on NYU's [Greene](https://sites.google.com/nyu.edu/nyu-hpc/hpc-systems/greene) high-performance compute cluster. We publicly release all code needed to reproduce our results on the [Github](https://github.com/baskargroup/Arboretum) page. | |
| We optimize our hyperparameters prior to training with [Ray](https://docs.ray.io/en/latest/index.html). Our standard training parameters are as follows: | |
| ``` | |
| --dataset-type webdataset | |
| --pretrained openai | |
| --text_type random | |
| --dataset-resampled | |
| --warmup 5000 | |
| --batch-size 4096 | |
| --accum-freq 1 | |
| --epochs 40 | |
| --workers 8 | |
| --model ViT-B-16 | |
| --lr 0.0005 | |
| --wd 0.0004 | |
| --precision bf16 | |
| --beta1 0.98 | |
| --beta2 0.99 | |
| --eps 1.0e-6 | |
| --local-loss | |
| --gather-with-grad | |
| --ddp-static-graph | |
| --grad-checkpointing | |
| ``` | |
| For more extensive documentation of the training process and the significance of each hyperparameter, we recommend referencing the [OpenCLIP](https://github.com/mlfoundations/open_clip) and [BioCLIP](https://github.com/Imageomics/BioCLIP) documentation, respectively. | |
| ### Model Validation | |
| For validating the zero-shot accuracy of our trained models and comparing to other benchmarks, we use the [VLHub](https://github.com/penfever/vlhub) repository with some slight modifications. | |
| #### Pre-Run | |
| After cloning the [Github](https://github.com/baskargroup/BioTrove) repository and navigating to the `BioTrove/model_validation` directory, we recommend installing all the project requirements into a conda container; `pip install -r requirements.txt`. Also, before executing a command in VLHub, please add `BioTrove/model_validation/src` to your PYTHONPATH. | |
| ```bash | |
| export PYTHONPATH="$PYTHONPATH:$PWD/src"; | |
| ``` | |
| #### Base Command | |
| A basic BioTrove-CLIP model evaluation command can be launched as follows. This example would evaluate a CLIP-ResNet50 checkpoint whose weights resided at the path designated via the `--resume` flag on the ImageNet validation set, and would report the results to Weights and Biases. | |
| ```bash | |
| python src/training/main.py --batch-size=32 --workers=8 --imagenet-val "/imagenet/val/" --model="resnet50" --zeroshot-frequency=1 --image-size=224 --resume "/PATH/TO/WEIGHTS.pth" --report-to wandb | |
| ``` | |
| ### Training Links | |
| - **Main Dataset Repository:** [BioTrove](https://github.com/baskargroup/BioTrove) | |
| - **Dataset Paper:** BioTrove: A Large Curated Image Dataset Enabling AI for Biodiversity ([arXiv](https://arxiv.org/abs/2406.17720)) | |
| - **HF Dataset card:** [BioTrove-Train (40M)](https://huggingface.co/datasets/BGLab/BioTrove-Train) | |
| ### Model's Limitation | |
| All the `BioTrove-CLIP` models were evaluated on the challenging [CONFOUNDING-SPECIES](https://arxiv.org/abs/2306.02507) benchmark. However, all the models performed at or below random chance. This could be an interesting avenue for follow-up work and further expand the models capabilities. | |
| In general, we found that models trained on web-scraped data performed better with common | |
| names, whereas models trained on specialist datasets performed better when using scientific names. | |
| Additionally, models trained on web-scraped data excel at classifying at the highest taxonomic | |
| level (kingdom), while models begin to benefit from specialist datasets like [BioTrove-Train (40M)](https://huggingface.co/datasets/BGLab/BioTrove-Train) and | |
| [Tree-of-Life-10M](https://huggingface.co/datasets/imageomics/TreeOfLife-10M) at the lower taxonomic levels (order and species). From a practical standpoint, `BioTrove-CLIP` is highly accurate at the species level, and higher-level taxa can be deterministically derived from lower ones. | |
| Addressing these limitations will further enhance the applicability of models like `BioTrove-CLIP` in real-world biodiversity monitoring tasks. | |
| ### Acknowledgements | |
| This work was supported by the AI Research Institutes program supported by the NSF and USDA-NIFA under [AI Institute: for Resilient Agriculture](https://aiira.iastate.edu/), Award No. 2021-67021-35329. This was also | |
| partly supported by the NSF under CPS Frontier grant CNS-1954556. Also, we gratefully | |
| acknowledge the support of NYU IT [High Performance Computing](https://www.nyu.edu/life/information-technology/research-computing-services/high-performance-computing.html) resources, services, and staff | |
| expertise. | |
| <!--BibTex citation --> | |
| <section class="section" id="BibTeX"> | |
| <div class="container is-max-widescreen content"> | |
| <h2 class="title">Citation</h2> | |
| If you find the models and datasets useful in your research, please consider citing our paper: | |
| <pre><code>@misc{yang2024arboretumlargemultimodaldataset, | |
| title={Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity}, | |
| author={Chih-Hsuan Yang, Benjamin Feuer, Zaki Jubery, Zi K. Deng, Andre Nakkab, | |
| Md Zahid Hasan, Shivani Chiranjeevi, Kelly Marshall, Nirmal Baishnab, Asheesh K Singh, | |
| Arti Singh, Soumik Sarkar, Nirav Merchant, Chinmay Hegde, Baskar Ganapathysubramanian}, | |
| year={2024}, | |
| eprint={2406.17720}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CV}, | |
| url={https://arxiv.org/abs/2406.17720}, | |
| }</code></pre> | |
| </div> | |
| </section> | |
| <!--End BibTex citation --> | |
| --- | |
| For more details and access to the Arboretum dataset, please visit the [Project Page](https://baskargroup.github.io/Arboretum/). |