File size: 3,803 Bytes
baf3660 bf21269 baf3660 bf21269 baf3660 bf21269 2016e4e baf3660 bf21269 baf3660 bf21269 baf3660 bf21269 baf3660 bf21269 baf3660 bf21269 baf3660 bf21269 baf3660 bf21269 baf3660 bf21269 baf3660 bf21269 baf3660 bf21269 baf3660 bf21269 baf3660 bf21269 baf3660 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 |
---
library_name: transformers
tags:
- generated_from_trainer
metrics:
- accuracy
model-index:
- name: dna_model
results: []
---
# DNA Language Model (Char-level, Human-only)
This model is a character-level GPT-style language model trained exclusively on **human DNA**. It uses a custom tokenizer with a vocabulary of `A`, `C`, `G`, `T`, and a special end-of-text token, trained to predict the next base in 1024-base sequences.
---
## 🧬 Model Summary
* **Objective**: Next-token prediction over human genomic sequences
* **Tokenization**: Character-level (A, C, G, T)
* **Training data**: [simecek/Human\_DNA\_v0](https://huggingface.co/datasets/simecek/Human_DNA_v0)
* **Sequence length**: 1024 tokens
* **Final Validation Loss**: 1.0299 nats/token
* **Final Validation Accuracy**: 53.24%
> This model outperforms classical compressors like GeCo on human DNA entropy, achieving \~1.486 bits per base.
---
## 🔧 Tokenizer
The tokenizer is a minimal GPT-2-style vocabulary:
```json
{
"<|endoftext|>": 0,
"A": 1,
"C": 2,
"G": 3,
"T": 4
}
```
* Implemented via `GPT2TokenizerFast`
* Merges file is empty (no BPE applied)
---
## 📊 Dataset Preprocessing
* Original dataset is cleaned to keep only `A`, `C`, `G`, `T`
* Sequences are chunked into segments of length 1024
* Very short chunks (<200bp) are discarded
* A 10% split validation is made from the training set.
---
## 🚀 Intended Uses
This model can be used for:
* DNA sequence generation
* Genomic representation learning
* Predictive modeling for base-level structure
* Downstream fine-tuning for biological classification tasks
### Limitations
* Trained only on human genome; not suitable for other species
* No reverse-complement modeling
* No masked language modeling objective
---
## 🏋️ Training Details
### Hyperparameters
* learning\_rate: 0.0003
* train\_batch\_size: 64
* eval\_batch\_size: 8
* total\_train\_batch\_size: 256 (across 4 GPUs)
* total\_eval\_batch\_size: 32
* optimizer: AdamW (betas=(0.9, 0.999), epsilon=1e-08)
* lr\_scheduler: Linear with 1000 warmup steps
* epochs: 10.0
* mixed\_precision: Native AMP
### Hardware
* Multi-GPU training (4 devices)
* Transformers 4.52.0.dev0
* PyTorch 2.3.0+cu121
---
## 📈 Training Results
| Step | Epoch | Training Loss | Validation Loss | Accuracy |
| ----- | ----- | ------------- | --------------- | -------- |
| 5000 | 0.69 | 1.1252 | 1.1206 | 0.4745 |
| 10000 | 1.38 | 1.0835 | 1.0814 | 0.4991 |
| 15000 | 2.07 | 1.0641 | 1.0639 | 0.5103 |
| 20000 | 2.76 | 1.0563 | 1.0547 | 0.5163 |
| 25000 | 3.45 | 1.0504 | 1.0486 | 0.5204 |
| 30000 | 4.14 | 1.0439 | 1.0439 | 0.5233 |
| 35000 | 4.84 | 1.0425 | 1.0407 | 0.5254 |
| 40000 | 5.52 | 1.0365 | 1.0380 | 0.5271 |
| 45000 | 6.22 | 1.0325 | 1.0361 | 0.5284 |
| 50000 | 6.91 | 1.0322 | 1.0341 | 0.5296 |
| 55000 | 7.60 | 1.0307 | 1.0328 | 0.5305 |
| 60000 | 8.29 | 1.0267 | 1.0316 | 0.5313 |
| 65000 | 8.98 | 1.0273 | 1.0306 | 0.5320 |
| 70000 | 9.67 | 1.0270 | 1.0299 | 0.5324 |
---
## 🔗 References
* Tokenizer inspired by GPT-2 minimal vocab
* Dataset: [simecek/Human\_DNA\_v0](https://huggingface.co/datasets/simecek/Human_DNA_v0)
* Transformers: [https://github.com/huggingface/transformers](https://github.com/huggingface/transformers)
* PyTorch: [https://pytorch.org/](https://pytorch.org/)
---
## 📄 Citation
This model is part of ongoing research. A formal citation will be added when the associated paper is published.
If you use this model in academic work, please check back for updates.
|