--- license: mit language: - en tags: - zero-shot-image-classification - clip - biology - biodiversity - agronomy - CV - images - animals - species - taxonomy - rare species - endangered species - evolutionary biology - multimodal - knowledge-guided datasets: - Arboretum - imageomics/TreeOfLife-10M - iNat21 - BIOSCAN-1M - EOL --- # Model Card for ArborCLIP
Project Page GitHub PyPI arbor-process 0.1.0
ARBORCLIP is a new suite of vision-language foundation models for biodiversity. These CLIP-style foundation models were trained on [ARBORETUM-40M](https://baskargroup.github.io/Arboretum/), which is a large-scale dataset of 40 million images of 33K species of plants and animals. The models are evaluated on zero-shot image classification tasks. - **Model type:** Vision Transformer (ViT-B/16, ViT-L/14) - **License:** MIT - **Fine-tuned from model:** [OpenAI CLIP](https://github.com/mlfoundations/open_clip), [MetaCLIP](https://github.com/facebookresearch/MetaCLIP), [BioCLIP](https://github.com/Imageomics/BioCLIP) These models were developed for the benefit of the AI community as an open-source product. Thus, we request that any derivative products are also open-source. ### Model Description ArborCLIP is based on OpenAI's [CLIP](https://openai.com/research/clip) model. The models were trained on [ARBORETUM-40M](https://baskargroup.github.io/Arboretum/) for the following configurations: - **ARBORCLIP-O:** Trained a ViT-B/16 backbone initialized from the [OpenCLIP's](https://github.com/mlfoundations/open_clip) checkpoint. The training was conducted for 40 epochs. - **ARBORCLIP-B:** Trained a ViT-B/16 backbone initialized from the [BioCLIP's](https://github.com/Imageomics/BioCLIP) checkpoint. The training was conducted for 8 epochs. - **ARBORCLIP-M:** Trained a ViT-L/14 backbone initialized from the [MetaCLIP's](https://github.com/facebookresearch/MetaCLIP) checkpoint. The training was conducted for 12 epochs. To access the checkpoints of the above models, go to the `Files and versions` tab and download the weights. These weights can be directly used for zero-shot classification and finetuning. The filenames correspond to the specific model weights - - **ARBORCLIP-O:** - `arborclip-vit-b-16-from-openai-epoch-40.pt`, - **ARBORCLIP-B:** - `arborclip-vit-b-16-from-bioclip-epoch-8.pt` - **ARBORCLIP-M** - `arborclip-vit-l-14-from-metaclip-epoch-12.pt` ### Model Training **See the [Model Training](https://github.com/baskargroup/Arboretum?tab=readme-ov-file#model-training) section on the [Github](https://github.com/baskargroup/Arboretum) for examples of how to use ArborCLIP models in zero-shot image classification tasks.** We train three models using a modified version of the [BioCLIP / OpenCLIP](https://github.com/Imageomics/bioclip/tree/main/src/training) codebase. Each model is trained on Arboretum-40M, on 2 nodes, 8xH100 GPUs, on NYU's [Greene](https://sites.google.com/nyu.edu/nyu-hpc/hpc-systems/greene) high-performance compute cluster. We publicly release all code needed to reproduce our results on the [Github](https://github.com/baskargroup/Arboretum) page. We optimize our hyperparameters prior to training with [Ray](https://docs.ray.io/en/latest/index.html). Our standard training parameters are as follows: ``` --dataset-type webdataset --pretrained openai --text_type random --dataset-resampled --warmup 5000 --batch-size 4096 --accum-freq 1 --epochs 40 --workers 8 --model ViT-B-16 --lr 0.0005 --wd 0.0004 --precision bf16 --beta1 0.98 --beta2 0.99 --eps 1.0e-6 --local-loss --gather-with-grad --ddp-static-graph --grad-checkpointing ``` For more extensive documentation of the training process and the significance of each hyperparameter, we recommend referencing the [OpenCLIP](https://github.com/mlfoundations/open_clip) and [BioCLIP](https://github.com/Imageomics/BioCLIP) documentation, respectively. ### Model Validation For validating the zero-shot accuracy of our trained models and comparing to other benchmarks, we use the [VLHub](https://github.com/penfever/vlhub) repository with some slight modifications. #### Pre-Run After cloning the [Github](https://github.com/baskargroup/Arboretum) repository and navigating to the `Arboretum/model_validation` directory, we recommend installing all the project requirements into a conda container; `pip install -r requirements.txt`. Also, before executing a command in VLHub, please add `Arboretum/model_validation/src` to your PYTHONPATH. ```bash export PYTHONPATH="$PYTHONPATH:$PWD/src"; ``` #### Base Command A basic Arboretum model evaluation command can be launched as follows. This example would evaluate a CLIP-ResNet50 checkpoint whose weights resided at the path designated via the `--resume` flag on the ImageNet validation set, and would report the results to Weights and Biases. ```bash python src/training/main.py --batch-size=32 --workers=8 --imagenet-val "/imagenet/val/" --model="resnet50" --zeroshot-frequency=1 --image-size=224 --resume "/PATH/TO/WEIGHTS.pth" --report-to wandb ``` ### Training Dataset - **Dataset Repository:** [Arboretum](https://github.com/baskargroup/Arboretum) - **Dataset Paper:** Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity ([arXiv](https://arxiv.org/abs/2406.17720)) - **HF Dataset card:** [Arboretum](https://huggingface.co/datasets/ChihHsuan-Yang/Arboretum) ### Model's Limitation All the `ArborCLIP` models were evaluated on the challenging [CONFOUNDING-SPECIES](https://arxiv.org/abs/2306.02507) benchmark. However, all the models performed at or below random chance. This could be an interesting avenue for follow-up work and further expand the models capabilities. In general, we found that models trained on web-scraped data performed better with common names, whereas models trained on specialist datasets performed better when using scientific names. Additionally, models trained on web-scraped data excel at classifying at the highest taxonomic level (kingdom), while models begin to benefit from specialist datasets like [ARBORETUM-40M](https://baskargroup.github.io/Arboretum/) and [Tree-of-Life-10M](https://huggingface.co/datasets/imageomics/TreeOfLife-10M) at the lower taxonomic levels (order and species). From a practical standpoint, `ArborCLIP` is highly accurate at the species level, and higher-level taxa can be deterministically derived from lower ones. Addressing these limitations will further enhance the applicability of models like `ArborCLIP` in real-world biodiversity monitoring tasks. ### Acknowledgements This work was supported by the AI Research Institutes program supported by the NSF and USDA-NIFA under [AI Institute: for Resilient Agriculture](https://aiira.iastate.edu/), Award No. 2021-67021-35329. This was also partly supported by the NSF under CPS Frontier grant CNS-1954556. Also, we gratefully acknowledge the support of NYU IT [High Performance Computing](https://www.nyu.edu/life/information-technology/research-computing-services/high-performance-computing.html) resources, services, and staff expertise.

Citation

If you find the models and datasets useful in your research, please consider citing our paper:
@misc{yang2024arboretumlargemultimodaldataset,
        title={Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity}, 
        author={Chih-Hsuan Yang, Benjamin Feuer, Zaki Jubery, Zi K. Deng, Andre Nakkab,
           Md Zahid Hasan, Shivani Chiranjeevi, Kelly Marshall, Nirmal Baishnab, Asheesh K Singh,
            Arti Singh, Soumik Sarkar, Nirav Merchant, Chinmay Hegde, Baskar Ganapathysubramanian},
        year={2024},
        eprint={2406.17720},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2406.17720}, 
  }
--- For more details and access to the Arboretum dataset, please visit the [Project Page](https://baskargroup.github.io/Arboretum/).