vesteinn
/

gpt2-dna

@@ -9,70 +9,128 @@ model-index:
   results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# dna_model
-This model is a fine-tuned version of [](https://huggingface.co/) on an unknown dataset.
-It achieves the following results on the evaluation set:
-- Loss: 1.0299
-- Accuracy: 0.5324
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 0.0003
-- train_batch_size: 64
-- eval_batch_size: 8
-- seed: 42
-- distributed_type: multi-GPU
-- num_devices: 4
-- total_train_batch_size: 256
-- total_eval_batch_size: 32
-- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
-- lr_scheduler_type: linear
-- lr_scheduler_warmup_steps: 1000
-- num_epochs: 10.0
-- mixed_precision_training: Native AMP
-### Training results
-| Training Loss | Epoch  | Step  | Validation Loss | Accuracy |
-|:-------------:|:------:|:-----:|:---------------:|:--------:|
-| 1.1252        | 0.6908 | 5000  | 1.1206          | 0.4745   |
-| 1.0835        | 1.3816 | 10000 | 1.0814          | 0.4991   |
-| 1.0641        | 2.0724 | 15000 | 1.0639          | 0.5103   |
-| 1.0563        | 2.7632 | 20000 | 1.0547          | 0.5163   |
-| 1.0504        | 3.4540 | 25000 | 1.0486          | 0.5204   |
-| 1.0439        | 4.1448 | 30000 | 1.0439          | 0.5233   |
-| 1.0425        | 4.8356 | 35000 | 1.0407          | 0.5254   |
-| 1.0365        | 5.5264 | 40000 | 1.0380          | 0.5271   |
-| 1.0325        | 6.2172 | 45000 | 1.0361          | 0.5284   |
-| 1.0322        | 6.9080 | 50000 | 1.0341          | 0.5296   |
-| 1.0307        | 7.5988 | 55000 | 1.0328          | 0.5305   |
-| 1.0267        | 8.2896 | 60000 | 1.0316          | 0.5313   |
-| 1.0273        | 8.9804 | 65000 | 1.0306          | 0.5320   |
-| 1.027         | 9.6712 | 70000 | 1.0299          | 0.5324   |
-### Framework versions
-- Transformers 4.52.0.dev0
-- Pytorch 2.3.0+cu121
-- Datasets 3.0.0
-- Tokenizers 0.21.1

   results: []
 ---
+# DNA Language Model (Char-level, Human-only)
+This model is a character-level GPT-style language model trained exclusively on **human DNA**. It uses a custom tokenizer with a vocabulary of `A`, `C`, `G`, `T`, and a special end-of-text token, trained to predict the next base in 1024-base sequences.
+---
+## 🧬 Model Summary
+* **Objective**: Next-token prediction over human genomic sequences
+* **Tokenization**: Character-level (A, C, G, T)
+* **Training data**: [simecek/Human\_DNA\_v0](https://huggingface.co/datasets/simecek/Human_DNA_v0)
+* **Sequence length**: 1024 tokens
+* **Final Validation Loss**: 1.0299 nats/token
+* **Final Validation Accuracy**: 53.24%
+> This model outperforms classical compressors like GeCo on human DNA entropy, achieving \~1.486 bits per base.
+---
+## 🔧 Tokenizer
+The tokenizer is a minimal GPT-2-style vocabulary:
+```json
+{
+  "<|endoftext|>": 0,
+  "A": 1,
+  "C": 2,
+  "G": 3,
+  "T": 4
+}
+```
+* Implemented via `GPT2TokenizerFast`
+* Merges file is empty (no BPE applied)
+* Saved to the `dna_tokenizer/` directory for reuse
+---
+## 📊 Dataset Preprocessing
+* Original dataset is cleaned to keep only `A`, `C`, `G`, `T`
+* Sequences are chunked into segments of length 1024
+* Very short chunks (<200bp) are discarded
+* Resulting split sizes are saved as plain text in `processed_dna_data/`
+If no validation set is provided, a 10% split is made from the training set.
+---
+## 🚀 Intended Uses
+This model can be used for:
+* DNA sequence generation
+* Genomic representation learning
+* Predictive modeling for base-level structure
+* Downstream fine-tuning for biological classification tasks
+### Limitations
+* Trained only on human genome; not suitable for other species
+* No reverse-complement modeling
+* No masked language modeling objective
+---
+## 🏋️ Training Details
+### Hyperparameters
+* learning\_rate: 0.0003
+* train\_batch\_size: 64
+* eval\_batch\_size: 8
+* total\_train\_batch\_size: 256 (across 4 GPUs)
+* total\_eval\_batch\_size: 32
+* optimizer: AdamW (betas=(0.9, 0.999), epsilon=1e-08)
+* lr\_scheduler: Linear with 1000 warmup steps
+* epochs: 10.0
+* mixed\_precision: Native AMP
+### Hardware
+* Multi-GPU training (4 devices)
+* Transformers 4.52.0.dev0
+* PyTorch 2.3.0+cu121
+---
+## 📈 Training Results
+| Step  | Epoch | Training Loss | Validation Loss | Accuracy |
+| ----- | ----- | ------------- | --------------- | -------- |
+| 5000  | 0.69  | 1.1252        | 1.1206          | 0.4745   |
+| 10000 | 1.38  | 1.0835        | 1.0814          | 0.4991   |
+| 15000 | 2.07  | 1.0641        | 1.0639          | 0.5103   |
+| 20000 | 2.76  | 1.0563        | 1.0547          | 0.5163   |
+| 25000 | 3.45  | 1.0504        | 1.0486          | 0.5204   |
+| 30000 | 4.14  | 1.0439        | 1.0439          | 0.5233   |
+| 35000 | 4.84  | 1.0425        | 1.0407          | 0.5254   |
+| 40000 | 5.52  | 1.0365        | 1.0380          | 0.5271   |
+| 45000 | 6.22  | 1.0325        | 1.0361          | 0.5284   |
+| 50000 | 6.91  | 1.0322        | 1.0341          | 0.5296   |
+| 55000 | 7.60  | 1.0307        | 1.0328          | 0.5305   |
+| 60000 | 8.29  | 1.0267        | 1.0316          | 0.5313   |
+| 65000 | 8.98  | 1.0273        | 1.0306          | 0.5320   |
+| 70000 | 9.67  | 1.0270        | 1.0299          | 0.5324   |
+---
+## 🔗 References
+* Tokenizer inspired by GPT-2 minimal vocab
+* Dataset: [simecek/Human\_DNA\_v0](https://huggingface.co/datasets/simecek/Human_DNA_v0)
+* Transformers: [https://github.com/huggingface/transformers](https://github.com/huggingface/transformers)
+* PyTorch: [https://pytorch.org/](https://pytorch.org/)
+---
+## 📄 Citation
+This model is part of ongoing research. A formal citation will be added when the associated paper is published.
+If you use this model in academic work, please check back for updates.