DNA Language Model (Char-level, Human-only)

This model is a character-level GPT-style language model trained exclusively on human DNA. It uses a custom tokenizer with a vocabulary of A, C, G, T, and a special end-of-text token, trained to predict the next base in 1024-base sequences.

🧬 Model Summary

Objective: Next-token prediction over human genomic sequences
Tokenization: Character-level (A, C, G, T)
Training data: simecek/Human_DNA_v0
Sequence length: 1024 tokens
Final Validation Loss: 1.0299 nats/token
Final Validation Accuracy: 53.24%

This model outperforms classical compressors like GeCo on human DNA entropy, achieving ~1.486 bits per base.

🔧 Tokenizer

The tokenizer is a minimal GPT-2-style vocabulary:

{
  "<|endoftext|>": 0,
  "A": 1,
  "C": 2,
  "G": 3,
  "T": 4
}

Implemented via GPT2TokenizerFast
Merges file is empty (no BPE applied)

📊 Dataset Preprocessing

Original dataset is cleaned to keep only A, C, G, T
Sequences are chunked into segments of length 1024
Very short chunks (<200bp) are discarded
A 10% split validation is made from the training set.

🚀 Intended Uses

This model can be used for:

DNA sequence generation
Genomic representation learning
Predictive modeling for base-level structure
Downstream fine-tuning for biological classification tasks

Limitations

Trained only on human genome; not suitable for other species
No reverse-complement modeling
No masked language modeling objective

🏋️ Training Details

Hyperparameters

learning_rate: 0.0003
train_batch_size: 64
eval_batch_size: 8
total_train_batch_size: 256 (across 4 GPUs)
total_eval_batch_size: 32
optimizer: AdamW (betas=(0.9, 0.999), epsilon=1e-08)
lr_scheduler: Linear with 1000 warmup steps
epochs: 10.0
mixed_precision: Native AMP

Hardware

Multi-GPU training (4 devices)
Transformers 4.52.0.dev0
PyTorch 2.3.0+cu121

📈 Training Results

Step	Epoch	Training Loss	Validation Loss	Accuracy
5000	0.69	1.1252	1.1206	0.4745
10000	1.38	1.0835	1.0814	0.4991
15000	2.07	1.0641	1.0639	0.5103
20000	2.76	1.0563	1.0547	0.5163
25000	3.45	1.0504	1.0486	0.5204
30000	4.14	1.0439	1.0439	0.5233
35000	4.84	1.0425	1.0407	0.5254
40000	5.52	1.0365	1.0380	0.5271
45000	6.22	1.0325	1.0361	0.5284
50000	6.91	1.0322	1.0341	0.5296
55000	7.60	1.0307	1.0328	0.5305
60000	8.29	1.0267	1.0316	0.5313
65000	8.98	1.0273	1.0306	0.5320
70000	9.67	1.0270	1.0299	0.5324

🔗 References

Tokenizer inspired by GPT-2 minimal vocab
Dataset: simecek/Human_DNA_v0
Transformers: https://github.com/huggingface/transformers
PyTorch: https://pytorch.org/

📄 Citation

This model is part of ongoing research. A formal citation will be added when the associated paper is published.

If you use this model in academic work, please check back for updates.

vesteinn
/

gpt2-dna