vijayoct27 commited on
Commit
4c0e39f
·
verified ·
1 Parent(s): 661d534

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +160 -3
README.md CHANGED
@@ -1,3 +1,160 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ base_model:
6
+ - LongSafari/hyenadna-large-1m-seqlen-hf
7
+ - zhihan1996/DNABERT-2-117M
8
+ - InstaDeepAI/nucleotide-transformer-v2-50m-multi-species
9
+ pipeline_tag: text-classification
10
+ tags:
11
+ - metagenomics
12
+ - taxonomic-classification
13
+ - antimicrobial-resistance
14
+ - pathogen-detection
15
+ ---
16
+
17
+ # Genomic Language Models for Metagenomic Sequence Analysis
18
+
19
+ We provide genomic language models fine-tuned for the following tasks:
20
+
21
+ - **Taxonomic hierarchical classification**
22
+ - **Anti-microbial resistance gene identification**
23
+ - **Pathogenicity detection**
24
+
25
+ See [code](github.com/jhuapl-bio/microbert) for details on fine-tuning, evaluation, and implementation.
26
+
27
+ These are the official models implemented in [Evaluating the Effectiveness of Parameter-Efficient Fine-Tuning in Genomic Classification Tasks](https://www.biorxiv.org/content/10.1101/2025.08.21.671544v1) and []()
28
+
29
+ ---
30
+
31
+ ## Pretrained Foundation Models
32
+
33
+ Our models are built upon several pretrained genomic foundation models:
34
+
35
+ ### Nucleotide Transformer (NT)
36
+ - [InstaDeepAI/nucleotide-transformer-v2-50m-multi-species](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-50m-multi-species)
37
+ - [InstaDeepAI/nucleotide-transformer-v2-100m-multi-species](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-100m-multi-species)
38
+ - [InstaDeepAI/nucleotide-transformer-v2-250m-multi-species](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-250m-multi-species)
39
+
40
+ ### DNABERT
41
+ - [zhihan1996/DNABERT-2-117M](https://huggingface.co/zhihan1996/DNABERT-2-117M)
42
+ - [zhihan1996/DNABERT-S](https://huggingface.co/zhihan1996/DNABERT-S)
43
+
44
+ ### HyenaDNA
45
+ - [LongSafari/hyenadna-large-1m-seqlen-hf](https://huggingface.co/LongSafari/hyenadna-large-1m-seqlen-hf)
46
+ - [LongSafari/hyenadna-medium-450k-seqlen-hf](https://huggingface.co/LongSafari/hyenadna-medium-450k-seqlen-hf)
47
+ - [LongSafari/hyenadna-medium-160k-seqlen-hf](https://huggingface.co/LongSafari/hyenadna-medium-160k-seqlen-hf)
48
+ - [LongSafari/hyenadna-small-32k-seqlen-hf](https://huggingface.co/LongSafari/hyenadna-small-32k-seqlen-hf)
49
+
50
+ We sincerely thank the teams behind NT, DNABERT, and HyenaDNA for making their tokenizers and pre-trained models available for use :)
51
+
52
+ ---
53
+
54
+ ## Available Fine-Tuned Models
55
+
56
+ We provide the following available models for use.
57
+
58
+ - `taxonomy_nucleotide-transformer-v2-50m-multi-species`
59
+ - `taxonomy_DNABERT-2-117M`
60
+ - `taxonomy_hyenadna-large-1m-seqlen-hf`
61
+ - `amr_nucleotide-transformer-v2-50m-multi-species`
62
+ - `amr_DNABERT-2-117M`
63
+ - `amr_hyenadna-large-1m-seqlen-hf`
64
+ - `pathogenicity_nucleotide-transformer-v2-50m-multi-species`
65
+ - `pathogenicity_DNABERT-2-117M`
66
+ - `pathogenicity_hyenadna-large-1m-seqlen-hf`
67
+
68
+ To use these models, download the directories available here.
69
+ You must also follow the installation instructions available at [code](github.com/jhuapl-bio/microbert).
70
+ There are two available modes of operation: setup from source code and setup from Docker.
71
+ Given that you have followed the setup instructions from source code and have downloaded the model directories here, here is sample code to run inference:
72
+
73
+ ```
74
+ import json
75
+ from pathlib import Path
76
+ import torch
77
+ import torch.nn.functional as F
78
+ from transformers import (
79
+ AutoTokenizer,
80
+ )
81
+ from safetensors.torch import load_file
82
+
83
+ from analysis.experiment.utils.data_processor import DataProcessor
84
+ from analysis.experiment.models.hierarchical_model import (
85
+ HierarchicalClassificationModel,
86
+ )
87
+
88
+ # Replace with base directory containing all data processor, base model tokenizers, and trained model weights files
89
+ model_dir = Path('data/LongSafari__hyenadna-large-1m-seqlen-hf')
90
+ data_processor_dir = model_dir / "data_processor" # replace with directory containing your data processor
91
+ metadata_path = data_processor_dir / "metadata.json"
92
+ base_model_dir = model_dir / "base_model" # replace with directory containing your base model files
93
+ trained_model_dir = model_dir / "model" # replace with directory containing your trained model files
94
+ trained_model_path = trained_model_dir / "model.safetensors"
95
+
96
+ # Load metadata
97
+ with open(metadata_path, "r") as f:
98
+ metadata = json.load(f)
99
+
100
+ sequence_column = metadata["sequence_column"]
101
+ labels = metadata["labels"]
102
+ data_processor_filename = 'data_processor.pkl'
103
+
104
+ # load data processor
105
+ data_processor = DataProcessor(
106
+ sequence_column=sequence_column,
107
+ labels=labels,
108
+ save_file=data_processor_filename,
109
+ )
110
+ data_processor.load_processor(data_processor_dir)
111
+
112
+ # Get metadata-driven values
113
+ num_labels = data_processor.num_labels
114
+ class_weights = data_processor.class_weights
115
+
116
+ # Load tokenizer from Hugging Face Hub or local path
117
+ tokenizer = AutoTokenizer.from_pretrained(
118
+ pretrained_model_name_or_path=base_model_dir.as_posix(),
119
+ trust_remote_code=True,
120
+ local_files_only=True,
121
+ )
122
+ # Load fine-tuned model weights
123
+ model = HierarchicalClassificationModel(base_model_dir.as_posix(), num_labels, class_weights)
124
+ state_dict = load_file(trained_model_path)
125
+ model.load_state_dict(state_dict, strict=False)
126
+ input = "ATCG"
127
+
128
+ # Run inference
129
+ tokenized_input = tokenizer(
130
+ input,
131
+ return_tensors="pt", # Return results as PyTorch tensors
132
+ )
133
+ with torch.no_grad():
134
+ outputs = model(**tokenized_input)
135
+
136
+ for idx, col in enumerate(labels):
137
+ logits = outputs['logits'][idx] # [num_classes]
138
+ probs = F.softmax(logits, dim=-1).cpu()
139
+ topk = torch.topk(probs, k=1, dim=-1)
140
+ topk_index = topk.indices.numpy().ravel()
141
+ topk_prob = topk.values
142
+ topk_label = data_processor.encoders[col].inverse_transform(topk_index)
143
+ ```
144
+ ---
145
+
146
+ ## Authors & Contact
147
+
148
+ - Daniel Berman — [email protected]
149
+ - Daniel Jimenez — [email protected]
150
+ - Stanley Ta — [email protected]
151
+ - Brian Merritt — [email protected]
152
+ - Jeremy Ratcliff — [email protected]
153
+ - Vijay Narayan — [email protected]
154
+ - Molly Gallaghar - [email protected]
155
+
156
+ ---
157
+
158
+ ## Acknowledgement
159
+
160
+ This work was supported by funding from the **U.S. Centers for Disease Control and Prevention** through the **Office of Readiness and Response** under **Contract # 75D30124C20202**.