Model Card for BioCLIP 2

BioCLIP 2 is a foundation model for biology organismal images. It is trained on TreeOfLife-200M on the basis of a CLIP model (ViT-14/L) pre-trained on LAION-2B. BioCLIP 2 yields state-of-the-art performance in recognizing various species. More importantly, it demonstrates emergent properties beyond species classification after extensive hierarchical contrastive training.

Model Details

Model Description

Foundation models trained at scale exhibit emergent properties beyond their initial training objectives. BioCLIP 2 demonstrates such emergence beyond species classification by scaling up the hierarchical contrastive training proposed by BioCLIP. The model is trained on TreeOfLife-200M (the largest and most diverse available dataset of biology images). We evaluate BioCLIP 2 on a diverse set of biological tasks. Through training at scale, BioCLIP 2 improves species classification by 18.1% over BioCLIP. More importantly, we demonstrate that BioCLIP 2 generalizes to diverse biological questions beyond species classification solely through species-level supersvision. Further analysis reveals that BioCLIP 2 acquires two emergent properties through scaling up hierarchical contrastive learning: inter-species ecological alignment and intra-species variation separation.

Developed by: Jianyang Gu, Samuel Stevens, Elizabeth G Campolongo, Matthew J Thompson, Net Zhang, Jiaman Wu, Andrei Kopanev, Zheda Mai, Alexander E. White, James Balhoff, Wasila M Dahdul, Daniel Rubenstein, Hilmar Lapp, Tanya Berger-Wolf, Wei-Lun Chao, and Yu Su
Model type: The model uses a ViT-L/14 Transformer as an image encoder and uses a masked self-attention Transformer as a text encoder.
License: MIT
Fine-tuned from model: CLIP pre-trained on LAION-2B, ViT-L/14 (Model weight)

Model Sources

Homepage: https://imageomics.github.io/bioclip-2/
Repository: BioCLIP 2
Paper: BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning
Demo: Coming soon

Uses

Direct Use

The model can be used for zero-shot classification provided the species names. It can also be used for few-shot classification with some images serving as the support set. Additionally, it is also recommended to use BioCLIP 2 as a visual encoder for other biological visual tasks.

Bias, Risks, and Limitations

BioCLIP 2 is trained on an imbalanced dataset. Specifically, the TreeOfLife-200M dataset exhibits a long-tailed distribution across taxa. Therefore, the predictions of BioCLIP 2 might be biased toward well-represented species. For more details, see the discussion in the TreeOfLife-200M dataset card.

BioCLIP 2 and TreeOfLife-200M provide great potential to improve and enhance existing conservation efforts, in particular by facilitating recognition of threatened species. Unfortunately, as with many open-source efforts to further conservation goals, there is also potential for bad actors to make use of these tools for malicious purposes. Though the improvement on threatened species could make it easier for poachers to identify protected species, these types of tools are a force-multiplier to monitor illicit trade and sales of these same species. The primary risk to endangered species comes from disclosure of precise location information rather than improved classification capability. Our data does not provide geo-tagged information of the organisms included, minimizing the vulnerabilities that could be used in poaching.

How to Get Started with the Model

You can use the open_clip library to load BioCLIP 2.

import open_clip
model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:imageomics/bioclip-2')
tokenizer = open_clip.get_tokenizer('hf-hub:imageomics/bioclip-2')

Training Details

Training Data

The model was trained with TreeOfLife-200M. The dataset consists of nearly 214M images covering 952K taxa. The scale of TreeOfLife-200M fosters the emergent properties of BioCLIP 2.

In addition, we also used a subset of LAION-2B that consists of 26M samples for experience replay. This part of data was downloaded from the first three parquet metadata files of LAION-2B, and the first 800 tar files were used.

Training Procedure

Preprocessing

Standard CLIP image preprocessing is adopted in the training.

Training Hyperparameters

Training regime: bf16 mixed precision

We used an Adam optimizer with a maximum learning rate of 1e-4. 1,875 warming steps were adopted, followed by cosine decay. The batch size of biological images was 2,816 per GPU, and that of replay data was 320 per GPU. We trained the model on 32 GPUs for 30 epochs, with a weight decay of 0.2. Each input image was resized to 224 x 224 resolution.

Evaluation

We evaluated the model on both species classification and other biological visual tasks.

Testing Data

For species classification tasks, we tested BioCLIP 2 on the following 10 tasks:

NABirds: We used 555 visual categories of 48,640 images for test.
Meta-Album: We used the Plankton, Insects, Insects2, PlantNet, Fungi, PlantVillage, and Medicinal Leaf datasets from Meta-Album.
IDLE-OO Camera Traps: Species identification in camera trap images is a real-world scenario that BioCLIP 2 can be applied to. We collected a class-balanced test set from five LILA-BC camera trap datasets. For more information on this test set, please visit the dataset page.
Rare Species: This dataset was introduced in the first BioCLIP paper. It consists of 400 species labeled Near Threatened through Extinct in the Wild by the IUCN Red List, with 30 images per species. Top-1 accuracy is reported for both zero-shot and few-shot experiments.

For biological visual tasks beyond species classification, we used:

FishNet: We used the original training set (75,631) images to train a two-layer linear classifier on top of the extracted features to predict the feeding path and habitat labels. Then we tested the classifier with 18,901 images from the test set. Accuracy is reported as the metric, where only predicting all the 9 labels correctly counts as success.
NeWT: We used the 164 binary classification tasks proposed in the dataset. Micro-accuracy is reported across all the samples.
AwA2: We used the original train-test split for attribute classification. Macro-F1 score is reported across all the attributes.
Herbarium19: This is task to discover new species. We implement it as semi-supervised clustering. Clustering accuracy is calculated for the predictions on both seen and unseen classes.
PlantDoc: 2,598 images of 13 plant species and up to 17 classes of diseases are included in this dataset. We conducted the experiment in a multi-fold 1-shot learning fashion. Average accuracy over the test samples is reported.

More details regarding the evaluation implementation can be referred to in the paper.

Results

We show the zero-shot classification and non-species classification task results here. For more detailed results, please check the paper.

Model	Animals					Plants & Fungi				Rare Species	Mean
Model	NABirds	Plankton	Insects	Insects 2	Camera Trap	PlantNet	Fungi	PlantVillage	Med. Leaf	Rare Species	Mean
CLIP (ViT-L/14)	66.5	1.3	9.0	11.7	29.5	61.7	7.6	6.5	25.6	35.2	25.5
SigLIP	61.7	2.4	27.3	20.7	33.7	81.8	36.9	28.5	54.5	47.6	39.5
BioTrove-CLIP	39.4	1.0	20.5	15.7	10.7	64.4	38.2	15.7	31.6	24.6	26.2
BioCLIP	58.8	6.1	34.9	20.5	31.7	88.2	40.9	19.0	38.5	37.1	37.6
BioCLIP 2	74.9	3.9	55.3	27.7	53.9	96.8	83.8	25.1	57.8	76.8	55.6

Model	Animals			Plants		Mean
Model	FishNet	NeWT	AwA2	Herbarium19	PlantDoc	Mean
CLIP (ViT-L/14)	27.9	83.4	61.6	18.2	22.3	42.7
SigLIP	31.9	83.2	67.3	18.6	28.2	45.8
Supervised-IN21K	29.4	75.8	52.7	14.9	25.1	39.6
DINOv2	37.4	83.7	48.6	28.1	38.6	47.3
BioTrove-CLIP	22.1	82.5	45.7	20.4	37.7	41.7
BioCLIP	30.1	82.7	65.9	26.8	39.5	49.0
BioCLIP 2	39.8	89.1	69.5	48.6	40.4	57.5

Summary

BioCLIP 2 surpasses BioCLIP by 18.0% on zero-shot species classification benchmarks. More importantly, although the model is trained to discriminate different species, it also achieves the best performance on tasks beyond species classification. Notably, BioCLIP 2 yields a 10.2% performance gap over DINOv2, which is broadly used for diverse visual tasks.

Model Examination

Please check Section 5.4 of our paper, where we provide formal analysis for the emergent properties of BioCLIP 2.

Technical Specifications

Compute Infrastructure

The training was performed on 32 NVIDIA H100-80GB GPUs distributed over 4 nodes on Pittsburgh Supercomputing Center's Bridges-2 Cluster. It took 10 days to complete the training of 30 epochs.

Citation

BibTeX:

@software{Gu_BioCLIP_2_model,
  author = {Jianyang Gu and Samuel Stevens and Elizabeth G Campolongo and Matthew J Thompson and Net Zhang and Jiaman Wu and Andrei Kopanev and Zheda Mai and Alexander E. White and James Balhoff and Wasila M Dahdul and Daniel Rubenstein and Hilmar Lapp and Tanya Berger-Wolf and Wei-Lun Chao and Yu Su},
  license = {MIT},
  title = {{BioCLIP 2}},
  url = {https://huggingface.co/imageomics/bioclip-2},
  version = {1.0.0},
  doi = {},
  publisher = {Hugging Face},
  year = {2025}
}

Please also cite our paper:

@article{gu2025bioclip,
  title = {{B}io{CLIP} 2: Emergent Properties from Scaling Hierarchical Contrastive Learning}, 
  author = {Jianyang Gu and Samuel Stevens and Elizabeth G Campolongo and Matthew J Thompson and Net Zhang and Jiaman Wu and Andrei Kopanev and Zheda Mai and Alexander E. White and James Balhoff and Wasila M Dahdul and Daniel Rubenstein and Hilmar Lapp and Tanya Berger-Wolf and Wei-Lun Chao and Yu Su},
  year = {2025},
  eprint={2505.23883},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2505.23883}, 
}

Also consider citing OpenCLIP and BioCLIP:

@software{ilharco_gabriel_2021_5143773,
  author={Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig},
  title={OpenCLIP},
  year={2021},
  doi={10.5281/zenodo.5143773},
}

Original BioCLIP Model:

@software{bioclip2023,
  author = {Samuel Stevens and Jiaman Wu and Matthew J. Thompson and Elizabeth G. Campolongo and Chan Hee Song and David Edward Carlyn and Li Dong and Wasila M. Dahdul and Charles Stewart and Tanya Berger-Wolf and Wei-Lun Chao and Yu Su},
  doi = {10.57967/hf/1511},
  month = nov,
  title = {BioCLIP},
  version = {v0.1},
  year = {2023}
}

Original BioCLIP Paper:

@inproceedings{stevens2024bioclip,
  title = {{B}io{CLIP}: A Vision Foundation Model for the Tree of Life}, 
  author = {Samuel Stevens and Jiaman Wu and Matthew J Thompson and Elizabeth G Campolongo and Chan Hee Song and David Edward Carlyn and Li Dong and Wasila M Dahdul and Charles Stewart and Tanya Berger-Wolf and Wei-Lun Chao and Yu Su},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year = {2024},
  pages = {19412-19424}
}

Acknowledgements

This work was supported by the Imageomics Institute, which is funded by the US National Science Foundation's Harnessing the Data Revolution (HDR) program under Award #2118240 (Imageomics: A New Frontier of Biological Information Powered by Knowledge-Guided Machine Learning). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Model Card Authors

Jianyang Gu

Model Card Contact

[email protected]

imageomics
/

bioclip-2