Update README.md
Browse files
README.md
CHANGED
@@ -9,70 +9,128 @@ model-index:
|
|
9 |
results: []
|
10 |
---
|
11 |
|
12 |
-
|
13 |
-
should probably proofread and complete it, then remove this comment. -->
|
14 |
|
15 |
-
|
16 |
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
21 |
|
22 |
-
|
23 |
|
24 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
25 |
|
26 |
-
|
27 |
|
28 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
29 |
|
30 |
-
|
31 |
|
32 |
-
|
|
|
|
|
33 |
|
34 |
-
|
35 |
|
36 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
37 |
|
38 |
-
|
39 |
-
- learning_rate: 0.0003
|
40 |
-
- train_batch_size: 64
|
41 |
-
- eval_batch_size: 8
|
42 |
-
- seed: 42
|
43 |
-
- distributed_type: multi-GPU
|
44 |
-
- num_devices: 4
|
45 |
-
- total_train_batch_size: 256
|
46 |
-
- total_eval_batch_size: 32
|
47 |
-
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
|
48 |
-
- lr_scheduler_type: linear
|
49 |
-
- lr_scheduler_warmup_steps: 1000
|
50 |
-
- num_epochs: 10.0
|
51 |
-
- mixed_precision_training: Native AMP
|
52 |
|
53 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
54 |
|
55 |
-
|
56 |
-
|:-------------:|:------:|:-----:|:---------------:|:--------:|
|
57 |
-
| 1.1252 | 0.6908 | 5000 | 1.1206 | 0.4745 |
|
58 |
-
| 1.0835 | 1.3816 | 10000 | 1.0814 | 0.4991 |
|
59 |
-
| 1.0641 | 2.0724 | 15000 | 1.0639 | 0.5103 |
|
60 |
-
| 1.0563 | 2.7632 | 20000 | 1.0547 | 0.5163 |
|
61 |
-
| 1.0504 | 3.4540 | 25000 | 1.0486 | 0.5204 |
|
62 |
-
| 1.0439 | 4.1448 | 30000 | 1.0439 | 0.5233 |
|
63 |
-
| 1.0425 | 4.8356 | 35000 | 1.0407 | 0.5254 |
|
64 |
-
| 1.0365 | 5.5264 | 40000 | 1.0380 | 0.5271 |
|
65 |
-
| 1.0325 | 6.2172 | 45000 | 1.0361 | 0.5284 |
|
66 |
-
| 1.0322 | 6.9080 | 50000 | 1.0341 | 0.5296 |
|
67 |
-
| 1.0307 | 7.5988 | 55000 | 1.0328 | 0.5305 |
|
68 |
-
| 1.0267 | 8.2896 | 60000 | 1.0316 | 0.5313 |
|
69 |
-
| 1.0273 | 8.9804 | 65000 | 1.0306 | 0.5320 |
|
70 |
-
| 1.027 | 9.6712 | 70000 | 1.0299 | 0.5324 |
|
71 |
|
|
|
72 |
|
73 |
-
|
74 |
|
75 |
-
- Transformers 4.52.0.dev0
|
76 |
-
- Pytorch 2.3.0+cu121
|
77 |
-
- Datasets 3.0.0
|
78 |
-
- Tokenizers 0.21.1
|
|
|
9 |
results: []
|
10 |
---
|
11 |
|
12 |
+
# DNA Language Model (Char-level, Human-only)
|
|
|
13 |
|
14 |
+
This model is a character-level GPT-style language model trained exclusively on **human DNA**. It uses a custom tokenizer with a vocabulary of `A`, `C`, `G`, `T`, and a special end-of-text token, trained to predict the next base in 1024-base sequences.
|
15 |
|
16 |
+
---
|
17 |
+
|
18 |
+
## 🧬 Model Summary
|
19 |
+
|
20 |
+
* **Objective**: Next-token prediction over human genomic sequences
|
21 |
+
* **Tokenization**: Character-level (A, C, G, T)
|
22 |
+
* **Training data**: [simecek/Human\_DNA\_v0](https://huggingface.co/datasets/simecek/Human_DNA_v0)
|
23 |
+
* **Sequence length**: 1024 tokens
|
24 |
+
* **Final Validation Loss**: 1.0299 nats/token
|
25 |
+
* **Final Validation Accuracy**: 53.24%
|
26 |
+
|
27 |
+
> This model outperforms classical compressors like GeCo on human DNA entropy, achieving \~1.486 bits per base.
|
28 |
+
|
29 |
+
---
|
30 |
+
|
31 |
+
## 🔧 Tokenizer
|
32 |
+
|
33 |
+
The tokenizer is a minimal GPT-2-style vocabulary:
|
34 |
+
|
35 |
+
```json
|
36 |
+
{
|
37 |
+
"<|endoftext|>": 0,
|
38 |
+
"A": 1,
|
39 |
+
"C": 2,
|
40 |
+
"G": 3,
|
41 |
+
"T": 4
|
42 |
+
}
|
43 |
+
```
|
44 |
+
|
45 |
+
* Implemented via `GPT2TokenizerFast`
|
46 |
+
* Merges file is empty (no BPE applied)
|
47 |
+
* Saved to the `dna_tokenizer/` directory for reuse
|
48 |
+
|
49 |
+
---
|
50 |
+
|
51 |
+
## 📊 Dataset Preprocessing
|
52 |
+
|
53 |
+
* Original dataset is cleaned to keep only `A`, `C`, `G`, `T`
|
54 |
+
* Sequences are chunked into segments of length 1024
|
55 |
+
* Very short chunks (<200bp) are discarded
|
56 |
+
* Resulting split sizes are saved as plain text in `processed_dna_data/`
|
57 |
|
58 |
+
If no validation set is provided, a 10% split is made from the training set.
|
59 |
|
60 |
+
---
|
61 |
+
|
62 |
+
## 🚀 Intended Uses
|
63 |
+
|
64 |
+
This model can be used for:
|
65 |
+
|
66 |
+
* DNA sequence generation
|
67 |
+
* Genomic representation learning
|
68 |
+
* Predictive modeling for base-level structure
|
69 |
+
* Downstream fine-tuning for biological classification tasks
|
70 |
+
|
71 |
+
### Limitations
|
72 |
+
|
73 |
+
* Trained only on human genome; not suitable for other species
|
74 |
+
* No reverse-complement modeling
|
75 |
+
* No masked language modeling objective
|
76 |
+
|
77 |
+
---
|
78 |
+
|
79 |
+
## 🏋️ Training Details
|
80 |
|
81 |
+
### Hyperparameters
|
82 |
|
83 |
+
* learning\_rate: 0.0003
|
84 |
+
* train\_batch\_size: 64
|
85 |
+
* eval\_batch\_size: 8
|
86 |
+
* total\_train\_batch\_size: 256 (across 4 GPUs)
|
87 |
+
* total\_eval\_batch\_size: 32
|
88 |
+
* optimizer: AdamW (betas=(0.9, 0.999), epsilon=1e-08)
|
89 |
+
* lr\_scheduler: Linear with 1000 warmup steps
|
90 |
+
* epochs: 10.0
|
91 |
+
* mixed\_precision: Native AMP
|
92 |
|
93 |
+
### Hardware
|
94 |
|
95 |
+
* Multi-GPU training (4 devices)
|
96 |
+
* Transformers 4.52.0.dev0
|
97 |
+
* PyTorch 2.3.0+cu121
|
98 |
|
99 |
+
---
|
100 |
|
101 |
+
## 📈 Training Results
|
102 |
+
|
103 |
+
| Step | Epoch | Training Loss | Validation Loss | Accuracy |
|
104 |
+
| ----- | ----- | ------------- | --------------- | -------- |
|
105 |
+
| 5000 | 0.69 | 1.1252 | 1.1206 | 0.4745 |
|
106 |
+
| 10000 | 1.38 | 1.0835 | 1.0814 | 0.4991 |
|
107 |
+
| 15000 | 2.07 | 1.0641 | 1.0639 | 0.5103 |
|
108 |
+
| 20000 | 2.76 | 1.0563 | 1.0547 | 0.5163 |
|
109 |
+
| 25000 | 3.45 | 1.0504 | 1.0486 | 0.5204 |
|
110 |
+
| 30000 | 4.14 | 1.0439 | 1.0439 | 0.5233 |
|
111 |
+
| 35000 | 4.84 | 1.0425 | 1.0407 | 0.5254 |
|
112 |
+
| 40000 | 5.52 | 1.0365 | 1.0380 | 0.5271 |
|
113 |
+
| 45000 | 6.22 | 1.0325 | 1.0361 | 0.5284 |
|
114 |
+
| 50000 | 6.91 | 1.0322 | 1.0341 | 0.5296 |
|
115 |
+
| 55000 | 7.60 | 1.0307 | 1.0328 | 0.5305 |
|
116 |
+
| 60000 | 8.29 | 1.0267 | 1.0316 | 0.5313 |
|
117 |
+
| 65000 | 8.98 | 1.0273 | 1.0306 | 0.5320 |
|
118 |
+
| 70000 | 9.67 | 1.0270 | 1.0299 | 0.5324 |
|
119 |
|
120 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
121 |
|
122 |
+
## 🔗 References
|
123 |
+
|
124 |
+
* Tokenizer inspired by GPT-2 minimal vocab
|
125 |
+
* Dataset: [simecek/Human\_DNA\_v0](https://huggingface.co/datasets/simecek/Human_DNA_v0)
|
126 |
+
* Transformers: [https://github.com/huggingface/transformers](https://github.com/huggingface/transformers)
|
127 |
+
* PyTorch: [https://pytorch.org/](https://pytorch.org/)
|
128 |
+
|
129 |
+
---
|
130 |
|
131 |
+
## 📄 Citation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
132 |
|
133 |
+
This model is part of ongoing research. A formal citation will be added when the associated paper is published.
|
134 |
|
135 |
+
If you use this model in academic work, please check back for updates.
|
136 |
|
|
|
|
|
|
|
|