Vision Transformer (large-sized model) pre-trained with MAE utilizing multiple plankton datasets

This repository provides a Vision Transformer (ViT) large-sized model pre-trained using Masked Autoencoder (MAE) on multiple plankton datasets. The model was introduced in the paper Self-Supervised Pretraining for Fine-Grained Plankton Recognition. The model is the timm library's vit_mae_large_patch16_224 that has been pre-trained from scratch. In the paper this model is defined as no-daplankton.

Intended uses & limitations

You can use the model for plankton image classification. Do note, however that this model contains only the pre-trained encoder and no classifier.

Usage

The model can be easily loaded and used with the timm library in Python. Below are two examples of how to use it for feature extraction:

# With timm
import timm
from timm.data import resolve_data_config
from timm.data.transforms_factory import create_transform
model = timm.create_model("hf_hub:Jookare/no_daplankton_vit_large_patch16_224.mae", pretrained=True)
transform = create_transform(**resolve_data_config(model.pretrained_cfg, model=model))

# With Transformers
from transformers import AutoModel, AutoImageProcessor
model = AutoModel.from_pretrained("Jookare/no_daplankton_vit_large_patch16_224.mae")
processor = AutoImageProcessor.from_pretrained("Jookare/no_daplankton_vit_large_patch16_224.mae")

BibTeX entry and citation info

@misc{kareinen2025selfsupervised,
  title={Self-Supervised Pretraining for Fine-Grained Plankton Recognition}, 
  author={Joona Kareinen and Tuomas Eerola and Kaisa Kraft and Lasse Lensu and Sanna Suikkanen and Heikki Kälviäinen},
  year={2025},
  url={https://arxiv.org/abs/2503.11341}, 
}