File size: 6,778 Bytes
			
			| 4c0e39f e92d057 4c0e39f 3a4eaf8 4c0e39f 8727c2e 4c0e39f e92d057 87bd28c 4c0e39f 2347f2e 4c0e39f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 | ---
license: mit
language:
- en
base_model:
- LongSafari/hyenadna-large-1m-seqlen-hf
- zhihan1996/DNABERT-2-117M
- InstaDeepAI/nucleotide-transformer-v2-50m-multi-species
pipeline_tag: text-classification
tags:
- metagenomics
- taxonomic-classification
- antimicrobial-resistance
- pathogen-detection
---
# Genomic Language Models for Metagenomic Sequence Analysis
We provide genomic language models fine-tuned for the following tasks:
- **Taxonomic hierarchical classification**  
- **Anti-microbial resistance gene identification**  
- **Pathogenicity detection**
See [code](https://github.com/jhuapl-bio/microbert) for details on fine-tuning, evaluation, and implementation.
These are the official models implemented in [Evaluating the Effectiveness of Parameter-Efficient Fine-Tuning in Genomic Classification Tasks](https://www.biorxiv.org/content/10.1101/2025.08.21.671544v1).
---
## Pretrained Foundation Models
Our models are built upon several pretrained genomic foundation models:
### Nucleotide Transformer (NT)
- [InstaDeepAI/nucleotide-transformer-v2-50m-multi-species](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-50m-multi-species)  
- [InstaDeepAI/nucleotide-transformer-v2-100m-multi-species](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-100m-multi-species)  
- [InstaDeepAI/nucleotide-transformer-v2-250m-multi-species](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-250m-multi-species)
### DNABERT
- [zhihan1996/DNABERT-2-117M](https://huggingface.co/zhihan1996/DNABERT-2-117M)  
- [zhihan1996/DNABERT-S](https://huggingface.co/zhihan1996/DNABERT-S)
### HyenaDNA
- [LongSafari/hyenadna-large-1m-seqlen-hf](https://huggingface.co/LongSafari/hyenadna-large-1m-seqlen-hf)  
- [LongSafari/hyenadna-medium-450k-seqlen-hf](https://huggingface.co/LongSafari/hyenadna-medium-450k-seqlen-hf)  
- [LongSafari/hyenadna-medium-160k-seqlen-hf](https://huggingface.co/LongSafari/hyenadna-medium-160k-seqlen-hf)  
- [LongSafari/hyenadna-small-32k-seqlen-hf](https://huggingface.co/LongSafari/hyenadna-small-32k-seqlen-hf)
We sincerely thank the teams behind NT, DNABERT, and HyenaDNA for making their tokenizers and pre-trained models available for use :) 
---
## Available Fine-Tuned Models
We provide the following available models for use.
- `taxonomy/DNABERT-2-117M-taxonomy`  
- `taxonomy/hyenadna-large-1m-seqlen-hf-taxonomy`  
- `taxonomy/nucleotide-transformer-v2-50m-multi-species-taxonomy`  
- `amr/binary/hyenadna-small-32k-seqlen-hf`  
- `amr/binary/nucleotide-transformer-v2-100m-multi-species`  
- `amr/multiclass/DNABERT-S`  
- `amr/multiclass/hyenadna-medium-450k-seqlen-hf`  
- `amr/multiclass/nucleotide-transformer-v2-250m-multi-species`  
- `pathogenicity/hyenadna-small-32k-seqlen-hf-DeePaC-fungal`  
- `pathogenicity/hyenadna-small-32k-seqlen-hf-DeePaC-viral`  
- `pathogenicity/hyenadna-small-32k-seqlen-hf-DeepSim-bacterial`  
- `pathogenicity/hyenadna-small-32k-seqlen-hf-DeepSim-viral`  
- `pathogenicity/nucleotide-transformer-v2-50m-multi-species-DeePaC-fungal`  
- `pathogenicity/nucleotide-transformer-v2-50m-multi-species-DeePaC-viral`  
- `pathogenicity/nucleotide-transformer-v2-50m-multi-species-DeepSim-bacterial`  
- `pathogenicity/nucleotide-transformer-v2-50m-multi-species-DeepSim-viral`  
To use these models, download the directories available here. 
You should also follow the installation instructions available at our [code](https://github.com/jhuapl-bio/microbert).
There are two available modes of operation: setup from source code and setup from our pre-built [docker image](https://hub.docker.com/r/jhuaplbio/microbert-classify).
Given that you have followed the setup instructions from source code and have downloaded the model directories here, here is sample code to run inference:
```
import json
from pathlib import Path
import torch
import torch.nn.functional as F
from transformers import (
    AutoTokenizer,
)
from safetensors.torch import load_file
from analysis.experiment.utils.data_processor import DataProcessor
from analysis.experiment.models.hierarchical_model import (
    HierarchicalClassificationModel,
)
# Replace with base directory containing all data processor, base model tokenizers, and trained model weights files
model_dir = Path('data/LongSafari__hyenadna-large-1m-seqlen-hf')
data_processor_dir = model_dir / "data_processor" # replace with directory containing your data processor
metadata_path = data_processor_dir / "metadata.json"
base_model_dir = model_dir / "base_model" # replace with directory containing your base model files
trained_model_dir = model_dir / "model" # replace with directory containing your trained model files
trained_model_path = trained_model_dir / "model.safetensors"
# Load metadata
with open(metadata_path, "r") as f:
    metadata = json.load(f)
sequence_column = metadata["sequence_column"]
labels = metadata["labels"]
data_processor_filename = 'data_processor.pkl'
# load data processor
data_processor = DataProcessor(
    sequence_column=sequence_column,
    labels=labels,
    save_file=data_processor_filename,
)
data_processor.load_processor(data_processor_dir)
# Get metadata-driven values
num_labels = data_processor.num_labels
class_weights = data_processor.class_weights
# Load tokenizer from Hugging Face Hub or local path
tokenizer = AutoTokenizer.from_pretrained(
    pretrained_model_name_or_path=base_model_dir.as_posix(),
    trust_remote_code=True,
    local_files_only=True,
)
# Load fine-tuned model weights
model = HierarchicalClassificationModel(base_model_dir.as_posix(), num_labels, class_weights)
state_dict = load_file(trained_model_path)
model.load_state_dict(state_dict, strict=False)
input = "ATCG"
# Run inference
tokenized_input = tokenizer(
    input,
    return_tensors="pt", # Return results as PyTorch tensors
)
with torch.no_grad():
    outputs = model(**tokenized_input)
for idx, col in enumerate(labels):
    logits = outputs['logits'][idx]  # [num_classes]
    probs = F.softmax(logits, dim=-1).cpu()
    topk = torch.topk(probs, k=1, dim=-1)
    topk_index = topk.indices.numpy().ravel()
    topk_prob = topk.values
    topk_label = data_processor.encoders[col].inverse_transform(topk_index)
```
---
## Authors & Contact
- Daniel Berman β [email protected]  
- Daniel Jimenez β [email protected]  
- Stanley Ta β [email protected]
- Brian Merritt β [email protected]
- Jeremy Ratcliff β [email protected]
- Vijay Narayan β [email protected]  
- Molly Gallagher - [email protected]
---
## Acknowledgement
This work was supported by funding from the **U.S. Centers for Disease Control and Prevention** through the **Office of Readiness and Response** under **Contract # 75D30124C20202**.
 | 
