license: apache-2.0
language:
- en
base_model:
- SantiagoSanchezF/BiomedBERT_mgnify_studies
pipeline_tag: text-classification
tags:
- biology
- metagenomics
- biome
- environment
datasets:
- SantiagoSanchezF/trapiche_training_dataset
Model Card for Model ID
The model takes textual descriptions of metagenomic studies and assigns one or more biome labels (e.g., soil, freshwater, marine) from a predefined list of environmental categories. Essentially, it reads the text, decides which biomes best match the description, and outputs those as predictions.
Model Details
Model Description
Multi-label classification model of biome of origin for a metagenomics study. Specifically, we fine-tuned a BERT-based model SantiagoSanchezF/BiomedBERT_mgnify_studies. Our dataset contained textual descriptions of studies along with labels representing different biome categories (53 in total). Because a single study can be associated with multiple biome labels at once, we applied a multi-label approach rather than a standard single-label setup.
The ultimate goal of this model is to facilitate automatic biome classification of metagenomic studies. By providing fast, accurate predictions, it helps researchers and data managers quickly organize new studies into their respective biome categories, streamlining large-scale metagenomics analyses.
- Developed by: SantiagoSanchezF
- Model type: Text-classification
- Language(s) (NLP): English
- Finetuned from model: SantiagoSanchezF/BiomedBERT_mgnify_studies
Training Details
Training Data
The training data for this model was synthetically generated by prompting a large language model (ChatGPT o1) to produce realistic metagenomic study descriptions for each biome of interest. Distinct project titles and abstracts were created to capture diverse terminology and ecological contexts. Each synthetic record was then assigned an appropriate label reflecting its corresponding biome category. The process, including code and detailed instructions, is publicly available in [Publication].
Training Procedure
A multi-label classification model was trained to predict the biome of origin for metagenomic samples by fine-tuning a BERT-based architecture. Textual descriptions of metagenomic studies were gathered, and each sample was assigned one or more labels drawn from a set of 53 biome classes defined by the GOLD environmental classification ontology. maximum sequence length set to 256 tokens. All samples were encoded into token IDs, attention masks, and segment embeddings as required by the BERT model. Fine-tuning was conducted with the Trainer API in the Hugging Face Transformers library, and the model head was configured for multi-label classification using a sigmoid output layer and binary cross-entropy with logits (BCEWithLogitsLoss).
Training was executed for 45 epochs with an initial learning rate of 5×10⁻⁵ and a batch size of 8, and optimization was carried out using the AdamW algorithm. Early stopping was enabled, and patience was set to 12 epochs of no improvement in macro F2 score on the validation set.