File size: 3,803 Bytes
baf3660
 
 
 
 
 
 
 
 
 
 
bf21269
baf3660
bf21269
baf3660
bf21269
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2016e4e
baf3660
bf21269
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
baf3660
bf21269
baf3660
bf21269
 
 
 
 
 
 
 
 
baf3660
bf21269
baf3660
bf21269
 
 
baf3660
bf21269
baf3660
bf21269
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
baf3660
bf21269
baf3660
bf21269
 
 
 
 
 
 
 
baf3660
bf21269
baf3660
bf21269
baf3660
bf21269
baf3660
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
library_name: transformers
tags:
- generated_from_trainer
metrics:
- accuracy
model-index:
- name: dna_model
  results: []
---

# DNA Language Model (Char-level, Human-only)

This model is a character-level GPT-style language model trained exclusively on **human DNA**. It uses a custom tokenizer with a vocabulary of `A`, `C`, `G`, `T`, and a special end-of-text token, trained to predict the next base in 1024-base sequences.

---

## 🧬 Model Summary

* **Objective**: Next-token prediction over human genomic sequences
* **Tokenization**: Character-level (A, C, G, T)
* **Training data**: [simecek/Human\_DNA\_v0](https://huggingface.co/datasets/simecek/Human_DNA_v0)
* **Sequence length**: 1024 tokens
* **Final Validation Loss**: 1.0299 nats/token
* **Final Validation Accuracy**: 53.24%

> This model outperforms classical compressors like GeCo on human DNA entropy, achieving \~1.486 bits per base.

---

## 🔧 Tokenizer

The tokenizer is a minimal GPT-2-style vocabulary:

```json
{
  "<|endoftext|>": 0,
  "A": 1,
  "C": 2,
  "G": 3,
  "T": 4
}
```

* Implemented via `GPT2TokenizerFast`
* Merges file is empty (no BPE applied)

---

## 📊 Dataset Preprocessing

* Original dataset is cleaned to keep only `A`, `C`, `G`, `T`
* Sequences are chunked into segments of length 1024
* Very short chunks (<200bp) are discarded
* A 10% split validation is made from the training set.

---

## 🚀 Intended Uses

This model can be used for:

* DNA sequence generation
* Genomic representation learning
* Predictive modeling for base-level structure
* Downstream fine-tuning for biological classification tasks

### Limitations

* Trained only on human genome; not suitable for other species
* No reverse-complement modeling
* No masked language modeling objective

---

## 🏋️ Training Details

### Hyperparameters

* learning\_rate: 0.0003
* train\_batch\_size: 64
* eval\_batch\_size: 8
* total\_train\_batch\_size: 256 (across 4 GPUs)
* total\_eval\_batch\_size: 32
* optimizer: AdamW (betas=(0.9, 0.999), epsilon=1e-08)
* lr\_scheduler: Linear with 1000 warmup steps
* epochs: 10.0
* mixed\_precision: Native AMP

### Hardware

* Multi-GPU training (4 devices)
* Transformers 4.52.0.dev0
* PyTorch 2.3.0+cu121

---

## 📈 Training Results

| Step  | Epoch | Training Loss | Validation Loss | Accuracy |
| ----- | ----- | ------------- | --------------- | -------- |
| 5000  | 0.69  | 1.1252        | 1.1206          | 0.4745   |
| 10000 | 1.38  | 1.0835        | 1.0814          | 0.4991   |
| 15000 | 2.07  | 1.0641        | 1.0639          | 0.5103   |
| 20000 | 2.76  | 1.0563        | 1.0547          | 0.5163   |
| 25000 | 3.45  | 1.0504        | 1.0486          | 0.5204   |
| 30000 | 4.14  | 1.0439        | 1.0439          | 0.5233   |
| 35000 | 4.84  | 1.0425        | 1.0407          | 0.5254   |
| 40000 | 5.52  | 1.0365        | 1.0380          | 0.5271   |
| 45000 | 6.22  | 1.0325        | 1.0361          | 0.5284   |
| 50000 | 6.91  | 1.0322        | 1.0341          | 0.5296   |
| 55000 | 7.60  | 1.0307        | 1.0328          | 0.5305   |
| 60000 | 8.29  | 1.0267        | 1.0316          | 0.5313   |
| 65000 | 8.98  | 1.0273        | 1.0306          | 0.5320   |
| 70000 | 9.67  | 1.0270        | 1.0299          | 0.5324   |

---

## 🔗 References

* Tokenizer inspired by GPT-2 minimal vocab
* Dataset: [simecek/Human\_DNA\_v0](https://huggingface.co/datasets/simecek/Human_DNA_v0)
* Transformers: [https://github.com/huggingface/transformers](https://github.com/huggingface/transformers)
* PyTorch: [https://pytorch.org/](https://pytorch.org/)

---

## 📄 Citation

This model is part of ongoing research. A formal citation will be added when the associated paper is published.

If you use this model in academic work, please check back for updates.