Ahmed Moustafa commited on
Commit
40a4b1e
·
0 Parent(s):

Initial commit

Browse files
Files changed (4) hide show
  1. .gitattributes +35 -0
  2. README.md +147 -0
  3. config.json +31 -0
  4. deeptaxa_april_2025.pt +3 -0
.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,147 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ tags:
4
+ - bioinformatics
5
+ - microbiology
6
+ - microbiome
7
+ - taxonomy-classification
8
+ - deep-learning
9
+ - 16s-rrna
10
+ datasets:
11
+ - systems-genomics-lab/greengenes
12
+ metrics:
13
+ - accuracy
14
+ - precision
15
+ - recall
16
+ - f1
17
+ license: mit
18
+ model-index:
19
+ - name: DeepTaxa Hybrid CNN-BERT (April 2025)
20
+ results:
21
+ - task:
22
+ type: classification
23
+ name: Hierarchical Taxonomy Classification
24
+ dataset:
25
+ type: systems-genomics-lab/greengenes
26
+ name: Greengenes (2024-09 Validation Split)
27
+ split: validation
28
+ metrics:
29
+ - type: accuracy
30
+ value: 0.9999258655200534
31
+ name: Domain Accuracy
32
+ - type: accuracy
33
+ value: 0.9992339437072182
34
+ name: Phylum Accuracy
35
+ - type: accuracy
36
+ value: 0.9988879828008006
37
+ name: Class Accuracy
38
+ - type: accuracy
39
+ value: 0.9971581782687128
40
+ name: Order Accuracy
41
+ - type: accuracy
42
+ value: 0.9950824128302074
43
+ name: Family Accuracy
44
+ - type: accuracy
45
+ value: 0.9833444535053253
46
+ name: Genus Accuracy
47
+ - type: accuracy
48
+ value: 0.9528751822472632
49
+ name: Species Accuracy
50
+ ---
51
+
52
+ # DeepTaxa: Hybrid CNN-BERT Model (April 2025)
53
+
54
+ **DeepTaxa** is a deep learning framework for hierarchical taxonomy classification of 16S rRNA gene sequences. This repository hosts the pre-trained hybrid CNN-BERT model, combining convolutional neural networks (CNNs) and BERT for high-accuracy predictions across seven taxonomic levels: domain, phylum, class, order, family, genus, and species.
55
+
56
+ ## Model Details
57
+ - **Architecture**: HybridCNNBERTClassifier (CNN + BERT)
58
+ - **Tokenizer**: `zhihan1996/DNABERT-2-117M`
59
+ - **Training Data**: Greengenes dataset (2024-09 split)
60
+ - **Levels Predicted**: 7 (Domain: 2 labels, Phylum: 106, Class: 244, Order: 630, Family: 1353, Genus: 4798, Species: 10547)
61
+ - **Total Parameters**: 72,635,154
62
+ - **Max Sequence Length**: 512
63
+ - **Dropout Probability**: 0.2
64
+ - **License**: MIT
65
+ - **Version**: April 2025
66
+ - **File**: `deeptaxa_april_2025.pt`
67
+
68
+ ## Usage
69
+
70
+ ### Download the Model
71
+ To get started, download the pre-trained model file `deeptaxa_april_2025.pt` from this repository:
72
+
73
+ - **Manual Download**: Visit [https://huggingface.co/systems-genomics-lab/deeptaxa](https://huggingface.co/systems-genomics-lab/deeptaxa), click on the "Files and versions" tab, and download `deeptaxa_april_2025.pt` (871 MB).
74
+ - **Command Line (wget)**:
75
+ ```bash
76
+ wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa_april_2025.pt
77
+ ```
78
+ - **Command Line (git clone)**:
79
+ ```bash
80
+ git clone https://huggingface.co/systems-genomics-lab/deeptaxa
81
+ cd deeptaxa
82
+ # The model file is now in the current directory
83
+ ```
84
+
85
+ ### Run Predictions
86
+ Once downloaded, use the model with the DeepTaxa CLI:
87
+ ```bash
88
+ python -m deeptaxa.cli predict \
89
+ --fasta-file /path/to/sequences.fna.gz \
90
+ --checkpoint deeptaxa_april_2025.pt
91
+ ```
92
+
93
+ Full instructions are available on the [GitHub repository](https://github.com/systems-genomics-lab/deeptaxa).
94
+
95
+ ## Training Details
96
+ - **Dataset**: 161,866 training sequences, 40,467 validation sequences from [Greengenes](https://huggingface.co/datasets/systems-genomics-lab/greengenes) (`gg_2024_09_training.fna.gz`, `gg_2024_09_training.tsv.gz`)
97
+ - **Hyperparameters**:
98
+ - Learning Rate: 0.0001
99
+ - Batch Size: 16
100
+ - Epochs: 10
101
+ - Optimizer: AdamW (lr=0.0001, betas=[0.9, 0.999], weight_decay=0.01)
102
+ - Focal Loss Gamma: 2.0
103
+ - Level Weights: [1.0, 1.5, 2.0, 2.5, 3.0, 4.0, 5.0]
104
+ - **Training Time**: ~21 minutes (1,254 seconds) on NVIDIA A40 GPU
105
+ - **Timestamp**: Trained on 2025-04-04
106
+
107
+ ## Performance
108
+ Validation metrics (on 40,467 sequences):
109
+ | Level | Accuracy | Precision | Recall | F1-Score |
110
+ |----------|----------|-----------|--------|----------|
111
+ | Domain | 99.99% | 99.99% | 99.99% | 99.99% |
112
+ | Phylum | 99.92% | 99.92% | 99.92% | 99.92% |
113
+ | Class | 99.89% | 99.85% | 99.89% | 99.87% |
114
+ | Order | 99.72% | 99.64% | 99.72% | 99.67% |
115
+ | Family | 99.51% | 99.32% | 99.51% | 99.40% |
116
+ | Genus | 98.33% | 97.89% | 98.33% | 98.01% |
117
+ | Species | 95.29% | 94.34% | 95.29% | 94.56% |
118
+ - **Training Loss**: 0.283
119
+ - **Validation Loss**: 0.606
120
+
121
+
122
+ ## Intended Use
123
+ - Taxonomy classification in microbiome research and microbial ecology.
124
+
125
+ ## Limitations
126
+ - GPU recommended (trained on NVIDIA A40).
127
+ - Lower precision at species level due to label complexity (10,547 classes).
128
+
129
+ ## Citation
130
+ If you use this model in your research, please cite:
131
+ ```bibtex
132
+ @software{DeepTaxa,
133
+ author = {{Systems Genomics Lab}},
134
+ title = {DeepTaxa: Hierarchical Taxonomy Classification of 16S rRNA Sequences with Deep Learning},
135
+ year = {2025},
136
+ publisher = {GitHub},
137
+ url = {https://github.com/systems-genomics-lab/deeptaxa},
138
+ }
139
+ ```
140
+
141
+ ## Contact
142
+ Open an issue on [GitHub](https://github.com/systems-genomics-lab/deeptaxa/issues) for support.
143
+
144
+ ## Acknowledgements
145
+ - **[Dr. Olaitan I. Awe](https://github.com/laitanawe)** and the Omics Codeathon team for their mentorship and contributions.
146
+ - **[Hugging Face](https://huggingface.co/)** for providing a platform to host datasets and models.
147
+ - **The High-Performance Computing Team of [the School of Sciences and Engineering (SSE)](https://sse.aucegypt.edu/) at [the American University in Cairo (AUC)](https://www.aucegypt.edu/)** for their support and for granting access to GPU resources that enabled this work.
config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "version": "deeptaxa.v1.0.0",
3
+ "model_type": "hybridcnnbert",
4
+ "tokenizer_name": "zhihan1996/DNABERT-2-117M",
5
+ "max_length": 512,
6
+ "dropout_prob": 0.2,
7
+ "total_parameters": 72635154,
8
+ "taxonomic_levels": {
9
+ "domain": 2,
10
+ "phylum": 106,
11
+ "class": 244,
12
+ "order": 630,
13
+ "family": 1353,
14
+ "genus": 4798,
15
+ "species": 10547
16
+ },
17
+ "training_hyperparameters": {
18
+ "learning_rate": 0.0001,
19
+ "batch_size": 16,
20
+ "epochs": 10,
21
+ "focal_gamma": 2.0,
22
+ "level_weights": [1.0, 1.5, 2.0, 2.5, 3.0, 4.0, 5.0],
23
+ "optimizer": "AdamW",
24
+ "optimizer_params": {
25
+ "lr": 0.0001,
26
+ "betas": [0.9, 0.999],
27
+ "weight_decay": 0.01
28
+ }
29
+ },
30
+ "training_date": "2025-04-04"
31
+ }
deeptaxa_april_2025.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8529452ac7a964b5be1d2dfbc775b59900701fdfa418ae632533815b059d83e5
3
+ size 871288250