--- library_name: transformers tags: - generated_from_trainer metrics: - accuracy model-index: - name: dna_model results: [] --- # DNA Language Model (Char-level, Human-only) This model is a character-level GPT-style language model trained exclusively on **human DNA**. It uses a custom tokenizer with a vocabulary of `A`, `C`, `G`, `T`, and a special end-of-text token, trained to predict the next base in 1024-base sequences. --- ## 🧬 Model Summary * **Objective**: Next-token prediction over human genomic sequences * **Tokenization**: Character-level (A, C, G, T) * **Training data**: [simecek/Human\_DNA\_v0](https://huggingface.co/datasets/simecek/Human_DNA_v0) * **Sequence length**: 1024 tokens * **Final Validation Loss**: 1.0299 nats/token * **Final Validation Accuracy**: 53.24% > This model outperforms classical compressors like GeCo on human DNA entropy, achieving \~1.486 bits per base. --- ## 🔧 Tokenizer The tokenizer is a minimal GPT-2-style vocabulary: ```json { "<|endoftext|>": 0, "A": 1, "C": 2, "G": 3, "T": 4 } ``` * Implemented via `GPT2TokenizerFast` * Merges file is empty (no BPE applied) --- ## 📊 Dataset Preprocessing * Original dataset is cleaned to keep only `A`, `C`, `G`, `T` * Sequences are chunked into segments of length 1024 * Very short chunks (<200bp) are discarded * A 10% split validation is made from the training set. --- ## 🚀 Intended Uses This model can be used for: * DNA sequence generation * Genomic representation learning * Predictive modeling for base-level structure * Downstream fine-tuning for biological classification tasks ### Limitations * Trained only on human genome; not suitable for other species * No reverse-complement modeling * No masked language modeling objective --- ## 🏋️ Training Details ### Hyperparameters * learning\_rate: 0.0003 * train\_batch\_size: 64 * eval\_batch\_size: 8 * total\_train\_batch\_size: 256 (across 4 GPUs) * total\_eval\_batch\_size: 32 * optimizer: AdamW (betas=(0.9, 0.999), epsilon=1e-08) * lr\_scheduler: Linear with 1000 warmup steps * epochs: 10.0 * mixed\_precision: Native AMP ### Hardware * Multi-GPU training (4 devices) * Transformers 4.52.0.dev0 * PyTorch 2.3.0+cu121 --- ## 📈 Training Results | Step | Epoch | Training Loss | Validation Loss | Accuracy | | ----- | ----- | ------------- | --------------- | -------- | | 5000 | 0.69 | 1.1252 | 1.1206 | 0.4745 | | 10000 | 1.38 | 1.0835 | 1.0814 | 0.4991 | | 15000 | 2.07 | 1.0641 | 1.0639 | 0.5103 | | 20000 | 2.76 | 1.0563 | 1.0547 | 0.5163 | | 25000 | 3.45 | 1.0504 | 1.0486 | 0.5204 | | 30000 | 4.14 | 1.0439 | 1.0439 | 0.5233 | | 35000 | 4.84 | 1.0425 | 1.0407 | 0.5254 | | 40000 | 5.52 | 1.0365 | 1.0380 | 0.5271 | | 45000 | 6.22 | 1.0325 | 1.0361 | 0.5284 | | 50000 | 6.91 | 1.0322 | 1.0341 | 0.5296 | | 55000 | 7.60 | 1.0307 | 1.0328 | 0.5305 | | 60000 | 8.29 | 1.0267 | 1.0316 | 0.5313 | | 65000 | 8.98 | 1.0273 | 1.0306 | 0.5320 | | 70000 | 9.67 | 1.0270 | 1.0299 | 0.5324 | --- ## 🔗 References * Tokenizer inspired by GPT-2 minimal vocab * Dataset: [simecek/Human\_DNA\_v0](https://huggingface.co/datasets/simecek/Human_DNA_v0) * Transformers: [https://github.com/huggingface/transformers](https://github.com/huggingface/transformers) * PyTorch: [https://pytorch.org/](https://pytorch.org/) --- ## 📄 Citation This model is part of ongoing research. A formal citation will be added when the associated paper is published. If you use this model in academic work, please check back for updates.