DNA Language Model (Char-level, Human-only)
This model is a character-level GPT-style language model trained exclusively on human DNA. It uses a custom tokenizer with a vocabulary of A
, C
, G
, T
, and a special end-of-text token, trained to predict the next base in 1024-base sequences.
🧬 Model Summary
- Objective: Next-token prediction over human genomic sequences
- Tokenization: Character-level (A, C, G, T)
- Training data: simecek/Human_DNA_v0
- Sequence length: 1024 tokens
- Final Validation Loss: 1.0299 nats/token
- Final Validation Accuracy: 53.24%
This model outperforms classical compressors like GeCo on human DNA entropy, achieving ~1.486 bits per base.
🔧 Tokenizer
The tokenizer is a minimal GPT-2-style vocabulary:
{
"<|endoftext|>": 0,
"A": 1,
"C": 2,
"G": 3,
"T": 4
}
- Implemented via
GPT2TokenizerFast
- Merges file is empty (no BPE applied)
📊 Dataset Preprocessing
- Original dataset is cleaned to keep only
A
,C
,G
,T
- Sequences are chunked into segments of length 1024
- Very short chunks (<200bp) are discarded
- A 10% split validation is made from the training set.
🚀 Intended Uses
This model can be used for:
- DNA sequence generation
- Genomic representation learning
- Predictive modeling for base-level structure
- Downstream fine-tuning for biological classification tasks
Limitations
- Trained only on human genome; not suitable for other species
- No reverse-complement modeling
- No masked language modeling objective
🏋️ Training Details
Hyperparameters
- learning_rate: 0.0003
- train_batch_size: 64
- eval_batch_size: 8
- total_train_batch_size: 256 (across 4 GPUs)
- total_eval_batch_size: 32
- optimizer: AdamW (betas=(0.9, 0.999), epsilon=1e-08)
- lr_scheduler: Linear with 1000 warmup steps
- epochs: 10.0
- mixed_precision: Native AMP
Hardware
- Multi-GPU training (4 devices)
- Transformers 4.52.0.dev0
- PyTorch 2.3.0+cu121
📈 Training Results
Step | Epoch | Training Loss | Validation Loss | Accuracy |
---|---|---|---|---|
5000 | 0.69 | 1.1252 | 1.1206 | 0.4745 |
10000 | 1.38 | 1.0835 | 1.0814 | 0.4991 |
15000 | 2.07 | 1.0641 | 1.0639 | 0.5103 |
20000 | 2.76 | 1.0563 | 1.0547 | 0.5163 |
25000 | 3.45 | 1.0504 | 1.0486 | 0.5204 |
30000 | 4.14 | 1.0439 | 1.0439 | 0.5233 |
35000 | 4.84 | 1.0425 | 1.0407 | 0.5254 |
40000 | 5.52 | 1.0365 | 1.0380 | 0.5271 |
45000 | 6.22 | 1.0325 | 1.0361 | 0.5284 |
50000 | 6.91 | 1.0322 | 1.0341 | 0.5296 |
55000 | 7.60 | 1.0307 | 1.0328 | 0.5305 |
60000 | 8.29 | 1.0267 | 1.0316 | 0.5313 |
65000 | 8.98 | 1.0273 | 1.0306 | 0.5320 |
70000 | 9.67 | 1.0270 | 1.0299 | 0.5324 |
🔗 References
- Tokenizer inspired by GPT-2 minimal vocab
- Dataset: simecek/Human_DNA_v0
- Transformers: https://github.com/huggingface/transformers
- PyTorch: https://pytorch.org/
📄 Citation
This model is part of ongoing research. A formal citation will be added when the associated paper is published.
If you use this model in academic work, please check back for updates.
- Downloads last month
- 1,715
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support