Initial commit

Files changed (4) hide show

.gitattributes +35 -0
README.md +147 -0
config.json +31 -0
deeptaxa_april_2025.pt +3 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,147 @@

+---
+language: en
+tags:
+  - bioinformatics
+  - microbiology
+  - microbiome
+  - taxonomy-classification
+  - deep-learning
+  - 16s-rrna
+datasets:
+  - systems-genomics-lab/greengenes
+metrics:
+  - accuracy
+  - precision
+  - recall
+  - f1
+license: mit
+model-index:
+  - name: DeepTaxa Hybrid CNN-BERT (April 2025)
+    results:
+      - task:
+          type: classification
+          name: Hierarchical Taxonomy Classification
+        dataset:
+          type: systems-genomics-lab/greengenes
+          name: Greengenes (2024-09 Validation Split)
+          split: validation
+        metrics:
+          - type: accuracy
+            value: 0.9999258655200534
+            name: Domain Accuracy
+          - type: accuracy
+            value: 0.9992339437072182
+            name: Phylum Accuracy
+          - type: accuracy
+            value: 0.9988879828008006
+            name: Class Accuracy
+          - type: accuracy
+            value: 0.9971581782687128
+            name: Order Accuracy
+          - type: accuracy
+            value: 0.9950824128302074
+            name: Family Accuracy
+          - type: accuracy
+            value: 0.9833444535053253
+            name: Genus Accuracy
+          - type: accuracy
+            value: 0.9528751822472632
+            name: Species Accuracy
+---
+# DeepTaxa: Hybrid CNN-BERT Model (April 2025)
+**DeepTaxa** is a deep learning framework for hierarchical taxonomy classification of 16S rRNA gene sequences. This repository hosts the pre-trained hybrid CNN-BERT model, combining convolutional neural networks (CNNs) and BERT for high-accuracy predictions across seven taxonomic levels: domain, phylum, class, order, family, genus, and species.
+## Model Details
+- **Architecture**: HybridCNNBERTClassifier (CNN + BERT)
+- **Tokenizer**: `zhihan1996/DNABERT-2-117M`
+- **Training Data**: Greengenes dataset (2024-09 split)
+- **Levels Predicted**: 7 (Domain: 2 labels, Phylum: 106, Class: 244, Order: 630, Family: 1353, Genus: 4798, Species: 10547)
+- **Total Parameters**: 72,635,154
+- **Max Sequence Length**: 512
+- **Dropout Probability**: 0.2
+- **License**: MIT
+- **Version**: April 2025
+- **File**: `deeptaxa_april_2025.pt`
+## Usage
+### Download the Model
+To get started, download the pre-trained model file `deeptaxa_april_2025.pt` from this repository:
+- **Manual Download**: Visit [https://huggingface.co/systems-genomics-lab/deeptaxa](https://huggingface.co/systems-genomics-lab/deeptaxa), click on the "Files and versions" tab, and download `deeptaxa_april_2025.pt` (871 MB).
+- **Command Line (wget)**:
+  ```bash
+  wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa_april_2025.pt
+  ```
+- **Command Line (git clone)**:
+  ```bash
+  git clone https://huggingface.co/systems-genomics-lab/deeptaxa
+  cd deeptaxa
+  # The model file is now in the current directory
+  ```
+### Run Predictions
+Once downloaded, use the model with the DeepTaxa CLI:
+```bash
+python -m deeptaxa.cli predict \
+  --fasta-file /path/to/sequences.fna.gz \
+  --checkpoint deeptaxa_april_2025.pt
+```
+Full instructions are available on the [GitHub repository](https://github.com/systems-genomics-lab/deeptaxa).
+## Training Details
+- **Dataset**: 161,866 training sequences, 40,467 validation sequences from [Greengenes](https://huggingface.co/datasets/systems-genomics-lab/greengenes) (`gg_2024_09_training.fna.gz`, `gg_2024_09_training.tsv.gz`)
+- **Hyperparameters**:
+  - Learning Rate: 0.0001
+  - Batch Size: 16
+  - Epochs: 10
+  - Optimizer: AdamW (lr=0.0001, betas=[0.9, 0.999], weight_decay=0.01)
+  - Focal Loss Gamma: 2.0
+  - Level Weights: [1.0, 1.5, 2.0, 2.5, 3.0, 4.0, 5.0]
+- **Training Time**: ~21 minutes (1,254 seconds) on NVIDIA A40 GPU
+- **Timestamp**: Trained on 2025-04-04
+## Performance
+Validation metrics (on 40,467 sequences):
+| Level    | Accuracy | Precision | Recall | F1-Score |
+|----------|----------|-----------|--------|----------|
+| Domain   | 99.99%   | 99.99%    | 99.99% | 99.99%   |
+| Phylum   | 99.92%   | 99.92%    | 99.92% | 99.92%   |
+| Class    | 99.89%   | 99.85%    | 99.89% | 99.87%   |
+| Order    | 99.72%   | 99.64%    | 99.72% | 99.67%   |
+| Family   | 99.51%   | 99.32%    | 99.51% | 99.40%   |
+| Genus    | 98.33%   | 97.89%    | 98.33% | 98.01%   |
+| Species  | 95.29%   | 94.34%    | 95.29% | 94.56%   |
+- **Training Loss**: 0.283
+- **Validation Loss**: 0.606
+## Intended Use
+- Taxonomy classification in microbiome research and microbial ecology.
+## Limitations
+- GPU recommended (trained on NVIDIA A40).
+- Lower precision at species level due to label complexity (10,547 classes).
+## Citation
+If you use this model in your research, please cite:
+```bibtex
+@software{DeepTaxa,
+  author = {{Systems Genomics Lab}},
+  title = {DeepTaxa: Hierarchical Taxonomy Classification of 16S rRNA Sequences with Deep Learning},
+  year = {2025},
+  publisher = {GitHub},
+  url = {https://github.com/systems-genomics-lab/deeptaxa},
+}
+```
+## Contact
+Open an issue on [GitHub](https://github.com/systems-genomics-lab/deeptaxa/issues) for support.
+## Acknowledgements
+- **[Dr. Olaitan I. Awe](https://github.com/laitanawe)** and the Omics Codeathon team for their mentorship and contributions.
+- **[Hugging Face](https://huggingface.co/)** for providing a platform to host datasets and models.
+- **The High-Performance Computing Team of [the School of Sciences and Engineering (SSE)](https://sse.aucegypt.edu/) at [the American University in Cairo (AUC)](https://www.aucegypt.edu/)** for their support and for granting access to GPU resources that enabled this work.

config.json ADDED Viewed

	@@ -0,0 +1,31 @@

+{
+  "version": "deeptaxa.v1.0.0",
+  "model_type": "hybridcnnbert",
+  "tokenizer_name": "zhihan1996/DNABERT-2-117M",
+  "max_length": 512,
+  "dropout_prob": 0.2,
+  "total_parameters": 72635154,
+  "taxonomic_levels": {
+    "domain": 2,
+    "phylum": 106,
+    "class": 244,
+    "order": 630,
+    "family": 1353,
+    "genus": 4798,
+    "species": 10547
+  },
+  "training_hyperparameters": {
+    "learning_rate": 0.0001,
+    "batch_size": 16,
+    "epochs": 10,
+    "focal_gamma": 2.0,
+    "level_weights": [1.0, 1.5, 2.0, 2.5, 3.0, 4.0, 5.0],
+    "optimizer": "AdamW",
+    "optimizer_params": {
+      "lr": 0.0001,
+      "betas": [0.9, 0.999],
+      "weight_decay": 0.01
+    }
+  },
+  "training_date": "2025-04-04"
+}

deeptaxa_april_2025.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8529452ac7a964b5be1d2dfbc775b59900701fdfa418ae632533815b059d83e5
+size 871288250