File size: 3,803 Bytes

---
library_name: transformers
tags:
- generated_from_trainer
metrics:
- accuracy
model-index:
- name: dna_model
  results: []
---

# DNA Language Model (Char-level, Human-only)

This model is a character-level GPT-style language model trained exclusively on **human DNA**. It uses a custom tokenizer with a vocabulary of `A`, `C`, `G`, `T`, and a special end-of-text token, trained to predict the next base in 1024-base sequences.

---

## 🧬 Model Summary

* **Objective**: Next-token prediction over human genomic sequences
* **Tokenization**: Character-level (A, C, G, T)
* **Training data**: [simecek/Human\_DNA\_v0](https://huggingface.co/datasets/simecek/Human_DNA_v0)
* **Sequence length**: 1024 tokens
* **Final Validation Loss**: 1.0299 nats/token
* **Final Validation Accuracy**: 53.24%

> This model outperforms classical compressors like GeCo on human DNA entropy, achieving \~1.486 bits per base.

---

## 🔧 Tokenizer

The tokenizer is a minimal GPT-2-style vocabulary:

```json
{
  "<|endoftext|>": 0,
  "A": 1,
  "C": 2,
  "G": 3,
  "T": 4
}
```

* Implemented via `GPT2TokenizerFast`
* Merges file is empty (no BPE applied)

---

## 📊 Dataset Preprocessing

* Original dataset is cleaned to keep only `A`, `C`, `G`, `T`
* Sequences are chunked into segments of length 1024
* Very short chunks (<200bp) are discarded
* A 10% split validation is made from the training set.

---

## 🚀 Intended Uses

This model can be used for:

* DNA sequence generation
* Genomic representation learning
* Predictive modeling for base-level structure
* Downstream fine-tuning for biological classification tasks

### Limitations

* Trained only on human genome; not suitable for other species
* No reverse-complement modeling
* No masked language modeling objective

---

## 🏋️ Training Details

### Hyperparameters

* learning\_rate: 0.0003
* train\_batch\_size: 64
* eval\_batch\_size: 8
* total\_train\_batch\_size: 256 (across 4 GPUs)
* total\_eval\_batch\_size: 32
* optimizer: AdamW (betas=(0.9, 0.999), epsilon=1e-08)
* lr\_scheduler: Linear with 1000 warmup steps
* epochs: 10.0
* mixed\_precision: Native AMP

### Hardware

* Multi-GPU training (4 devices)
* Transformers 4.52.0.dev0
* PyTorch 2.3.0+cu121

---

## 📈 Training Results

| Step  | Epoch | Training Loss | Validation Loss | Accuracy |
| ----- | ----- | ------------- | --------------- | -------- |
| 5000  | 0.69  | 1.1252        | 1.1206          | 0.4745   |
| 10000 | 1.38  | 1.0835        | 1.0814          | 0.4991   |
| 15000 | 2.07  | 1.0641        | 1.0639          | 0.5103   |
| 20000 | 2.76  | 1.0563        | 1.0547          | 0.5163   |
| 25000 | 3.45  | 1.0504        | 1.0486          | 0.5204   |
| 30000 | 4.14  | 1.0439        | 1.0439          | 0.5233   |
| 35000 | 4.84  | 1.0425        | 1.0407          | 0.5254   |
| 40000 | 5.52  | 1.0365        | 1.0380          | 0.5271   |
| 45000 | 6.22  | 1.0325        | 1.0361          | 0.5284   |
| 50000 | 6.91  | 1.0322        | 1.0341          | 0.5296   |
| 55000 | 7.60  | 1.0307        | 1.0328          | 0.5305   |
| 60000 | 8.29  | 1.0267        | 1.0316          | 0.5313   |
| 65000 | 8.98  | 1.0273        | 1.0306          | 0.5320   |
| 70000 | 9.67  | 1.0270        | 1.0299          | 0.5324   |

---

## 🔗 References

* Tokenizer inspired by GPT-2 minimal vocab
* Dataset: [simecek/Human\_DNA\_v0](https://huggingface.co/datasets/simecek/Human_DNA_v0)
* Transformers: [https://github.com/huggingface/transformers](https://github.com/huggingface/transformers)
* PyTorch: [https://pytorch.org/](https://pytorch.org/)

---

## 📄 Citation

This model is part of ongoing research. A formal citation will be added when the associated paper is published.

If you use this model in academic work, please check back for updates.