jhuapl-bio
/

microbert

+---
+license: mit
+language:
+- en
+base_model:
+- LongSafari/hyenadna-large-1m-seqlen-hf
+- zhihan1996/DNABERT-2-117M
+- InstaDeepAI/nucleotide-transformer-v2-50m-multi-species
+pipeline_tag: text-classification
+tags:
+- metagenomics
+- taxonomic-classification
+- antimicrobial-resistance
+- pathogen-detection
+---
+# Genomic Language Models for Metagenomic Sequence Analysis
+We provide genomic language models fine-tuned for the following tasks:
+- **Taxonomic hierarchical classification**
+- **Anti-microbial resistance gene identification**
+- **Pathogenicity detection**
+See [code](github.com/jhuapl-bio/microbert) for details on fine-tuning, evaluation, and implementation.
+These are the official models implemented in [Evaluating the Effectiveness of Parameter-Efficient Fine-Tuning in Genomic Classification Tasks](https://www.biorxiv.org/content/10.1101/2025.08.21.671544v1) and []()
+---
+## Pretrained Foundation Models
+Our models are built upon several pretrained genomic foundation models:
+### Nucleotide Transformer (NT)
+- [InstaDeepAI/nucleotide-transformer-v2-50m-multi-species](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-50m-multi-species)
+- [InstaDeepAI/nucleotide-transformer-v2-100m-multi-species](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-100m-multi-species)
+- [InstaDeepAI/nucleotide-transformer-v2-250m-multi-species](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-250m-multi-species)
+### DNABERT
+- [zhihan1996/DNABERT-2-117M](https://huggingface.co/zhihan1996/DNABERT-2-117M)
+- [zhihan1996/DNABERT-S](https://huggingface.co/zhihan1996/DNABERT-S)
+### HyenaDNA
+- [LongSafari/hyenadna-large-1m-seqlen-hf](https://huggingface.co/LongSafari/hyenadna-large-1m-seqlen-hf)
+- [LongSafari/hyenadna-medium-450k-seqlen-hf](https://huggingface.co/LongSafari/hyenadna-medium-450k-seqlen-hf)
+- [LongSafari/hyenadna-medium-160k-seqlen-hf](https://huggingface.co/LongSafari/hyenadna-medium-160k-seqlen-hf)
+- [LongSafari/hyenadna-small-32k-seqlen-hf](https://huggingface.co/LongSafari/hyenadna-small-32k-seqlen-hf)
+We sincerely thank the teams behind NT, DNABERT, and HyenaDNA for making their tokenizers and pre-trained models available for use :)
+---
+## Available Fine-Tuned Models
+We provide the following available models for use.
+- `taxonomy_nucleotide-transformer-v2-50m-multi-species`
+- `taxonomy_DNABERT-2-117M`
+- `taxonomy_hyenadna-large-1m-seqlen-hf`
+- `amr_nucleotide-transformer-v2-50m-multi-species`
+- `amr_DNABERT-2-117M`
+- `amr_hyenadna-large-1m-seqlen-hf`
+- `pathogenicity_nucleotide-transformer-v2-50m-multi-species`
+- `pathogenicity_DNABERT-2-117M`
+- `pathogenicity_hyenadna-large-1m-seqlen-hf`
+To use these models, download the directories available here.
+You must also follow the installation instructions available at [code](github.com/jhuapl-bio/microbert).
+There are two available modes of operation: setup from source code and setup from Docker.
+Given that you have followed the setup instructions from source code and have downloaded the model directories here, here is sample code to run inference:
+```
+import json
+from pathlib import Path
+import torch
+import torch.nn.functional as F
+from transformers import (
+    AutoTokenizer,
+)
+from safetensors.torch import load_file
+from analysis.experiment.utils.data_processor import DataProcessor
+from analysis.experiment.models.hierarchical_model import (
+    HierarchicalClassificationModel,
+)
+# Replace with base directory containing all data processor, base model tokenizers, and trained model weights files
+model_dir = Path('data/LongSafari__hyenadna-large-1m-seqlen-hf')
+data_processor_dir = model_dir / "data_processor" # replace with directory containing your data processor
+metadata_path = data_processor_dir / "metadata.json"
+base_model_dir = model_dir / "base_model" # replace with directory containing your base model files
+trained_model_dir = model_dir / "model" # replace with directory containing your trained model files
+trained_model_path = trained_model_dir / "model.safetensors"
+# Load metadata
+with open(metadata_path, "r") as f:
+    metadata = json.load(f)
+sequence_column = metadata["sequence_column"]
+labels = metadata["labels"]
+data_processor_filename = 'data_processor.pkl'
+# load data processor
+data_processor = DataProcessor(
+    sequence_column=sequence_column,
+    labels=labels,
+    save_file=data_processor_filename,
+)
+data_processor.load_processor(data_processor_dir)
+# Get metadata-driven values
+num_labels = data_processor.num_labels
+class_weights = data_processor.class_weights
+# Load tokenizer from Hugging Face Hub or local path
+tokenizer = AutoTokenizer.from_pretrained(
+    pretrained_model_name_or_path=base_model_dir.as_posix(),
+    trust_remote_code=True,
+    local_files_only=True,
+)
+# Load fine-tuned model weights
+model = HierarchicalClassificationModel(base_model_dir.as_posix(), num_labels, class_weights)
+state_dict = load_file(trained_model_path)
+model.load_state_dict(state_dict, strict=False)
+input = "ATCG"
+# Run inference
+tokenized_input = tokenizer(
+    input,
+    return_tensors="pt", # Return results as PyTorch tensors
+)
+with torch.no_grad():
+    outputs = model(**tokenized_input)
+for idx, col in enumerate(labels):
+    logits = outputs['logits'][idx]  # [num_classes]
+    probs = F.softmax(logits, dim=-1).cpu()
+    topk = torch.topk(probs, k=1, dim=-1)
+    topk_index = topk.indices.numpy().ravel()
+    topk_prob = topk.values
+    topk_label = data_processor.encoders[col].inverse_transform(topk_index)
+```
+---
+## Authors & Contact
+- Daniel Berman — [email protected]
+- Daniel Jimenez — [email protected]
+- Stanley Ta — [email protected]
+- Brian Merritt — [email protected]
+- Jeremy Ratcliff — [email protected]
+- Vijay Narayan — [email protected]
+- Molly Gallaghar - [email protected]
+---
+## Acknowledgement
+This work was supported by funding from the **U.S. Centers for Disease Control and Prevention** through the **Office of Readiness and Response** under **Contract # 75D30124C20202**.