DNA Language Model (Char-level, Human-only)

This model is a character-level GPT-style language model trained exclusively on human DNA. It uses a custom tokenizer with a vocabulary of A, C, G, T, and a special end-of-text token, trained to predict the next base in 1024-base sequences.


🧬 Model Summary

  • Objective: Next-token prediction over human genomic sequences
  • Tokenization: Character-level (A, C, G, T)
  • Training data: simecek/Human_DNA_v0
  • Sequence length: 1024 tokens
  • Final Validation Loss: 1.0299 nats/token
  • Final Validation Accuracy: 53.24%

This model outperforms classical compressors like GeCo on human DNA entropy, achieving ~1.486 bits per base.


🔧 Tokenizer

The tokenizer is a minimal GPT-2-style vocabulary:

{
  "<|endoftext|>": 0,
  "A": 1,
  "C": 2,
  "G": 3,
  "T": 4
}
  • Implemented via GPT2TokenizerFast
  • Merges file is empty (no BPE applied)

📊 Dataset Preprocessing

  • Original dataset is cleaned to keep only A, C, G, T
  • Sequences are chunked into segments of length 1024
  • Very short chunks (<200bp) are discarded
  • A 10% split validation is made from the training set.

🚀 Intended Uses

This model can be used for:

  • DNA sequence generation
  • Genomic representation learning
  • Predictive modeling for base-level structure
  • Downstream fine-tuning for biological classification tasks

Limitations

  • Trained only on human genome; not suitable for other species
  • No reverse-complement modeling
  • No masked language modeling objective

🏋️ Training Details

Hyperparameters

  • learning_rate: 0.0003
  • train_batch_size: 64
  • eval_batch_size: 8
  • total_train_batch_size: 256 (across 4 GPUs)
  • total_eval_batch_size: 32
  • optimizer: AdamW (betas=(0.9, 0.999), epsilon=1e-08)
  • lr_scheduler: Linear with 1000 warmup steps
  • epochs: 10.0
  • mixed_precision: Native AMP

Hardware

  • Multi-GPU training (4 devices)
  • Transformers 4.52.0.dev0
  • PyTorch 2.3.0+cu121

📈 Training Results

Step Epoch Training Loss Validation Loss Accuracy
5000 0.69 1.1252 1.1206 0.4745
10000 1.38 1.0835 1.0814 0.4991
15000 2.07 1.0641 1.0639 0.5103
20000 2.76 1.0563 1.0547 0.5163
25000 3.45 1.0504 1.0486 0.5204
30000 4.14 1.0439 1.0439 0.5233
35000 4.84 1.0425 1.0407 0.5254
40000 5.52 1.0365 1.0380 0.5271
45000 6.22 1.0325 1.0361 0.5284
50000 6.91 1.0322 1.0341 0.5296
55000 7.60 1.0307 1.0328 0.5305
60000 8.29 1.0267 1.0316 0.5313
65000 8.98 1.0273 1.0306 0.5320
70000 9.67 1.0270 1.0299 0.5324

🔗 References


📄 Citation

This model is part of ongoing research. A formal citation will be added when the associated paper is published.

If you use this model in academic work, please check back for updates.

Downloads last month
1,715
Safetensors
Model size
85.8M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support