File size: 17,977 Bytes
dd1457e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a308f78
dd1457e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a308f78
dd1457e
a74f1fe
dd1457e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15537b6
dd1457e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a74f1fe
dd1457e
 
a74f1fe
dd1457e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a74f1fe
dd1457e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a74f1fe
 
 
 
dd1457e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
---
license:
- mit
language:
- en
library_name: open_clip
tags:
- biology
- CV
- images
- imageomics
- clip
- species-classification
- biological visual task
- multimodal
- animals
- species
- taxonomy
- rare species
- endangered species
- evolutionary biology
- knowledge-guided
- zero-shot-image-classification
datasets:
- imageomics/TreeOfLife-200M
- GBIF
- bioscan-ml/BIOSCAN-5M
- EOL
- FathomNet
---

<!--
Image with caption (jpg or png):
|![Figure #](https://huggingface.co/imageomics/<model-repo>/resolve/main/<filepath>)|
|:--|
|**Figure #.** [Image of <>](https://huggingface.co/imageomics/<model-repo>/raw/main/<filepath>) <caption description>.|
-->

<!--
Notes on styling:

To render LaTex in your README, wrap the code in `\\(` and `\\)`. Example: \\(\frac{1}{2}\\)

Escape underscores ("_") with a "\". Example: image\_RGB
-->

# Model Card for BioCLIP 2

BioCLIP 2 is a foundation model for biology organismal images. It is trained on [TreeOfLife-200M](https://huggingface.co/datasets/imageomics/TreeOfLife-200M) on the basis of a [CLIP](https://huggingface.co/laion/CLIP-ViT-L-14-laion2B-s32B-b82K) model (ViT-14/L) pre-trained on LAION-2B.
BioCLIP 2 yields state-of-the-art performance in recognizing various species. More importantly, it demonstrates emergent properties beyond species classification after extensive hierarchical contrastive training.

## Model Details

### Model Description

Foundation models trained at scale exhibit emergent properties beyond their initial training objectives.
BioCLIP 2 demonstrates such emergence beyond species classification by scaling up the hierarchical contrastive training proposed by [BioCLIP](https://imageomics.github.io/bioclip/).
The model is trained on [TreeOfLife-200M](https://huggingface.co/datasets/imageomics/TreeOfLife-200M) (the largest and most diverse available dataset of biology images).
We evaluate BioCLIP 2 on a diverse set of biological tasks. Through training at scale, BioCLIP 2 improves species classification by 18.1% over BioCLIP. More importantly, we demonstrate that BioCLIP 2 generalizes to diverse biological questions beyond species classification solely through species-level supersvision. Further analysis reveals that BioCLIP 2 acquires two emergent properties through scaling up hierarchical contrastive learning: inter-species ecological alignment and intra-species variation separation.

- **Developed by:** Jianyang Gu, Samuel Stevens, Elizabeth G Campolongo, Matthew J Thompson, Net Zhang, Jiaman Wu, Andrei Kopanev, Zheda Mai, Alexander E. White, James Balhoff, Wasila M Dahdul, Daniel Rubenstein, Hilmar Lapp, Tanya Berger-Wolf, Wei-Lun Chao, and Yu Su
- **Model type:** The model uses a ViT-L/14 Transformer as an image encoder and uses a masked self-attention Transformer as a text encoder.
- **License:** MIT
- **Fine-tuned from model:** CLIP pre-trained on LAION-2B, ViT-L/14 ([Model weight](https://huggingface.co/laion/CLIP-ViT-L-14-laion2B-s32B-b82K))

### Model Sources

- **Homepage:** https://imageomics.github.io/bioclip-2/
- **Repository:** [BioCLIP 2](https://github.com/Imageomics/bioclip-2)
- **Paper:** [BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning](https://doi.org/10.48550/arXiv.2505.23883)
- **Demo:** Coming soon

## Uses

### Direct Use

The model can be used for zero-shot classification provided the species names.
It can also be used for few-shot classification with some images serving as the support set.
Additionally, it is also recommended to use BioCLIP 2 as a visual encoder for other biological visual tasks.

## Bias, Risks, and Limitations

BioCLIP 2 is trained on an imbalanced dataset. Specifically, the TreeOfLife-200M dataset exhibits a long-tailed distribution across taxa.
Therefore, the predictions of BioCLIP 2 might be biased toward well-represented species. For more details, see the [discussion in the TreeOfLife-200M dataset card](https://huggingface.co/datasets/imageomics/TreeOfLife-200M#considerations-for-using-the-data).

BioCLIP 2 and TreeOfLife-200M provide great potential to improve and enhance existing conservation efforts, in particular by facilitating recognition of threatened species. 
Unfortunately, as with many open-source efforts to further conservation goals, there is also potential for bad actors to make use of these tools for malicious purposes. Though the improvement on threatened species could make it easier for poachers to identify protected species, these types of tools are a force-multiplier to monitor illicit trade and sales of these same species. The primary risk to endangered species comes from disclosure of precise location information rather than improved classification capability. Our data does not provide geo-tagged information of the organisms included, minimizing the vulnerabilities that could be used in poaching.

<!-- 
### Recommendations

This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. 

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-->

## How to Get Started with the Model

You can use the `open_clip` library to load BioCLIP 2.

```
import open_clip
model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:imageomics/bioclip-2')
tokenizer = open_clip.get_tokenizer('hf-hub:imageomics/bioclip-2')
```

## Training Details

### Training Data

The model was trained with [TreeOfLife-200M](https://huggingface.co/datasets/imageomics/TreeOfLife-200M).
The dataset consists of nearly 214M images covering 952K taxa. 
The scale of TreeOfLife-200M fosters the emergent properties of BioCLIP 2.

In addition, we also used a subset of LAION-2B that consists of 26M samples for experience replay.
This part of data was downloaded from the first three parquet metadata files of LAION-2B, and the first 4,000 tar files were used.

### Training Procedure 

#### Preprocessing

Standard CLIP image preprocessing is adopted in the training.

#### Training Hyperparameters

- **Training regime:** bf16 mixed precision <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->

We used an Adam optimizer with a maximum learning rate of 1e-4. 1,875 warming steps were adopted, followed by cosine decay.
The batch size of biological images was 2,816 per GPU, and that of replay data was 320 per GPU.
We trained the model on 32 GPUs for 30 epochs, with a weight decay of 0.2.
Each input image was resized to 224 x 224 resolution.

## Evaluation

We evaluated the model on both species classification and other biological visual tasks.

### Testing Data

For species classification tasks, we tested BioCLIP 2 on the following 10 tasks:
* [NABirds](https://dl.allaboutbirds.org/nabirds): We used 555 visual categories of 48,640 images for test.
* [Meta-Album](https://meta-album.github.io/): We used the Plankton, Insects, Insects2, PlantNet, Fungi, PlantVillage, and Medicinal Leaf datasets from Meta-Album. 
* [IDLE-OO Camera Traps](https://huggingface.co/datasets/imageomics/IDLE-OO-Camera-Traps): Species identification in camera trap images is a real-world scenario that BioCLIP 2 can be applied to.
We collected a class-balanced test set from five LILA-BC camera trap datasets. For more information on this test set, please visit the [dataset page](https://huggingface.co/datasets/imageomics/IDLE-OO-Camera-Traps).
* [Rare Species](https://huggingface.co/datasets/imageomics/rare-species): This dataset was introduced in the first BioCLIP paper.
It consists of 400 species labeled Near Threatened through Extinct in the Wild by the [IUCN Red List](https://www.iucnredlist.org/), with 30 images per species.
Top-1 accuracy is reported for both zero-shot and few-shot experiments.

For biological visual tasks beyond species classification, we used:
* [FishNet](https://fishnet-2023.github.io/): We used the original training set (75,631) images to train a two-layer linear classifier on top of the extracted features to predict the feeding path and habitat labels.
Then we tested the classifier with 18,901 images from the test set. Accuracy is reported as the metric, where only predicting all the 9 labels correctly counts as success.
* [NeWT](https://github.com/visipedia/newt): We used the 164 binary classification tasks proposed in the dataset. Micro-accuracy is reported across all the samples.
* [AwA2](https://cvml.ista.ac.at/AwA2/): We used the original train-test split for attribute classification. Macro-F1 score is reported across all the attributes.
* [Herbarium19](https://www.kaggle.com/c/herbarium-2019-fgvc6): This is task to discover new species. We implement it as semi-supervised clustering. Clustering accuracy is calculated for the predictions on both seen and unseen classes.
* [PlantDoc](https://github.com/pratikkayal/PlantDoc-Dataset): 2,598 images of 13 plant species and up to 17 classes of diseases are included in this dataset. We conducted the experiment in a multi-fold 1-shot learning fashion. Average accuracy over the test samples is reported.

More details regarding the evaluation implementation can be referred to in the [paper](https://doi.org/10.48550/arXiv.2505.23883).

### Results
We show the zero-shot classification and non-species classification task results here. For more detailed results, please check the [paper](https://doi.org/10.48550/arXiv.2505.23883).
<table cellpadding="0" cellspacing="0">
      <thead>
        <tr>
          <th rowspan="2">Model</th>
          <th colspan="5">Animals</th>
          <th colspan="4">Plants & Fungi</th>
          <th rowspan="2">Rare Species</th>
          <th rowspan="2">Mean</th>
        </tr>
        <tr>
          <th>NABirds</th>
          <th>Plankton</th>
          <th>Insects</th>
          <th>Insects 2</th>
          <th>Camera Trap</th>
          <th>PlantNet</th>
          <th>Fungi</th>
          <th>PlantVillage</th>
          <th>Med. Leaf</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <td>CLIP (ViT-L/14)</td>
          <td>66.5</td>
          <td>1.3</td>
          <td>9.0</td>
          <td>11.7</td>
          <td>29.5</td>
          <td>61.7</td>
          <td>7.6</td>
          <td>6.5</td>
          <td>25.6</td>
          <td>35.2</td>
          <td>25.5</td>
        </tr>
        <tr>
          <td>SigLIP</td>
          <td>61.7</td>
          <td>2.4</td>
          <td>27.3</td>
          <td>20.7</td>
          <td>33.7</td>
          <td>81.8</td>
          <td>36.9</td>
          <td><b>28.5</b></td>
          <td>54.5</td>
          <td>47.6</td>
          <td>39.5</td>
        </tr>
        <tr>
          <td>BioTrove-CLIP</td>
          <td>39.4</td>
          <td>1.0</td>
          <td>20.5</td>
          <td>15.7</td>
          <td>10.7</td>
          <td>64.4</td>
          <td>38.2</td>
          <td>15.7</td>
          <td>31.6</td>
          <td>24.6</td>
          <td>26.2</td>
        </tr>
        <tr>
          <td>BioCLIP</td>
          <td>58.8</td>
          <td><b>6.1</b></td>
          <td>34.9</td>
          <td>20.5</td>
          <td>31.7</td>
          <td>88.2</td>
          <td>40.9</td>
          <td>19.0</td>
          <td>38.5</td>
          <td>37.1</td>
          <td>37.6</td>
        </tr>
        <tr>
          <td>BioCLIP 2</td>
          <td><b>74.9</b></td>
          <td>3.9</td>
          <td><b>55.3</b></td>
          <td><b>27.7</b></td>
          <td><b>53.9</b></td>
          <td><b>96.8</b></td>
          <td><b>83.8</b></td>
          <td>25.1</td>
          <td><b>57.8</b></td>
          <td><b>76.8</b></td>
          <td><b>55.6</b></td>
        </tr>
      </tbody>
    </table>

<table cellpadding="0" cellspacing="0">
      <thead>
        <tr>
          <th rowspan="2">Model</th>
          <th colspan="3">Animals</th>
          <th colspan="2">Plants</th>
          <th rowspan="2">Mean</th>
        </tr>
        <tr>
          <th>FishNet</th>
          <th>NeWT</th>
          <th>AwA2</th>
          <th>Herbarium19</th>
          <th>PlantDoc</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <td>CLIP (ViT-L/14)</td>
          <td>27.9</td>
          <td>83.4</td>
          <td>61.6</td>
          <td>18.2</td>
          <td>22.3</td>
          <td>42.7</td>
        </tr>
        <tr>
          <td>SigLIP</td>
          <td>31.9</td>
          <td>83.2</td>
          <td>67.3</td>
          <td>18.6</td>
          <td>28.2</td>
          <td>45.8</td>
        </tr>
        <tr>
          <td>Supervised-IN21K</td>
          <td>29.4</td>
          <td>75.8</td>
          <td>52.7</td>
          <td>14.9</td>
          <td>25.1</td>
          <td>39.6</td>
        </tr>
        <tr>
          <td>DINOv2</td>
          <td>37.4</td>
          <td>83.7</td>
          <td>48.6</td>
          <td>28.1</td>
          <td>38.6</td>
          <td>47.3</td>
        </tr>
        <tr>
          <td>BioTrove-CLIP</td>
          <td>22.1</td>
          <td>82.5</td>
          <td>45.7</td>
          <td>20.4</td>
          <td>37.7</td>
          <td>41.7</td>
        </tr>
        <tr>
          <td>BioCLIP</td>
          <td>30.1</td>
          <td>82.7</td>
          <td>65.9</td>
          <td>26.8</td>
          <td>39.5</td>
          <td>49.0</td>
        </tr>
        <tr>
          <td>BioCLIP 2</td>
          <td><b>39.8</b></td>
          <td><b>89.1</b></td>
          <td><b>69.5</b></td>
          <td><b>48.6</b></td>
          <td><b>40.4</b></td>
          <td><b>57.5</b></td>
        </tr>
      </tbody>
    </table>

#### Summary

BioCLIP 2 surpasses BioCLIP by 18.0% on zero-shot species classification benchmarks. 
More importantly, although the model is trained to discriminate different species, it also achieves the best performance on tasks beyond species classification.
Notably, BioCLIP 2 yields a 10.2% performance gap over DINOv2, which is broadly used for diverse visual tasks. 

## Model Examination

Please check Section 5.4 of our [paper](https://doi.org/10.48550/arXiv.2505.23883), where we provide formal analysis for the emergent properties of BioCLIP 2.

## Technical Specifications 

### Compute Infrastructure
The training was performed on 32 NVIDIA H100-80GB GPUs distributed over 4 nodes on [Pittsburgh Supercomputing Center](https://www.psc.edu/)'s Bridges-2 Cluster.
It took 10 days to complete the training of 30 epochs.


## Citation

**BibTeX:**
```​
@software{Gu_BioCLIP_2_model,
  author = {Jianyang Gu and Samuel Stevens and Elizabeth G Campolongo and Matthew J Thompson and Net Zhang and Jiaman Wu and Andrei Kopanev and Zheda Mai and Alexander E. White and James Balhoff and Wasila M Dahdul and Daniel Rubenstein and Hilmar Lapp and Tanya Berger-Wolf and Wei-Lun Chao and Yu Su},
  license = {MIT},
  title = {{BioCLIP 2}},
  url = {https://huggingface.co/imageomics/bioclip-2},
  version = {1.0.0},
  doi = {},
  publisher = {Hugging Face},
  year = {2025}
}
```
Please also cite our paper:
```
@article{gu2025bioclip,
  title = {{B}io{CLIP} 2: Emergent Properties from Scaling Hierarchical Contrastive Learning}, 
  author = {Jianyang Gu and Samuel Stevens and Elizabeth G Campolongo and Matthew J Thompson and Net Zhang and Jiaman Wu and Andrei Kopanev and Zheda Mai and Alexander E. White and James Balhoff and Wasila M Dahdul and Daniel Rubenstein and Hilmar Lapp and Tanya Berger-Wolf and Wei-Lun Chao and Yu Su},
  year = {2025},
  eprint={2505.23883},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2505.23883}, 
}
```

Also consider citing OpenCLIP and BioCLIP:

```
@software{ilharco_gabriel_2021_5143773,
  author={Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig},
  title={OpenCLIP},
  year={2021},
  doi={10.5281/zenodo.5143773},
}
```
Original BioCLIP Model:
```
@software{bioclip2023,
  author = {Samuel Stevens and Jiaman Wu and Matthew J. Thompson and Elizabeth G. Campolongo and Chan Hee Song and David Edward Carlyn and Li Dong and Wasila M. Dahdul and Charles Stewart and Tanya Berger-Wolf and Wei-Lun Chao and Yu Su},
  doi = {10.57967/hf/1511},
  month = nov,
  title = {BioCLIP},
  version = {v0.1},
  year = {2023}
}
```
Original BioCLIP Paper:
```
@inproceedings{stevens2024bioclip,
  title = {{B}io{CLIP}: A Vision Foundation Model for the Tree of Life}, 
  author = {Samuel Stevens and Jiaman Wu and Matthew J Thompson and Elizabeth G Campolongo and Chan Hee Song and David Edward Carlyn and Li Dong and Wasila M Dahdul and Charles Stewart and Tanya Berger-Wolf and Wei-Lun Chao and Yu Su},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year = {2024},
  pages = {19412-19424}
}
```

## Acknowledgements

This work was supported by the [Imageomics Institute](https://imageomics.org), which is funded by the US National Science Foundation's Harnessing the Data Revolution (HDR) program under [Award #2118240](https://www.nsf.gov/awardsearch/showAward?AWD_ID=2118240) (Imageomics: A New Frontier of Biological Information Powered by Knowledge-Guided Machine Learning). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

<!-- ## Glossary  -->

<!-- [optional] If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->

<!-- ## More Information  -->

<!-- [optional] Any other relevant information that doesn't fit elsewhere. -->

## Model Card Authors

Jianyang Gu

## Model Card Contact
[[email protected]](mailto:[email protected])