vesteinn commited on
Commit
bf21269
·
verified ·
1 Parent(s): baf3660

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +109 -51
README.md CHANGED
@@ -9,70 +9,128 @@ model-index:
9
  results: []
10
  ---
11
 
12
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
13
- should probably proofread and complete it, then remove this comment. -->
14
 
15
- # dna_model
16
 
17
- This model is a fine-tuned version of [](https://huggingface.co/) on an unknown dataset.
18
- It achieves the following results on the evaluation set:
19
- - Loss: 1.0299
20
- - Accuracy: 0.5324
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
- ## Model description
23
 
24
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
- ## Intended uses & limitations
27
 
28
- More information needed
 
 
 
 
 
 
 
 
29
 
30
- ## Training and evaluation data
31
 
32
- More information needed
 
 
33
 
34
- ## Training procedure
35
 
36
- ### Training hyperparameters
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
- The following hyperparameters were used during training:
39
- - learning_rate: 0.0003
40
- - train_batch_size: 64
41
- - eval_batch_size: 8
42
- - seed: 42
43
- - distributed_type: multi-GPU
44
- - num_devices: 4
45
- - total_train_batch_size: 256
46
- - total_eval_batch_size: 32
47
- - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
48
- - lr_scheduler_type: linear
49
- - lr_scheduler_warmup_steps: 1000
50
- - num_epochs: 10.0
51
- - mixed_precision_training: Native AMP
52
 
53
- ### Training results
 
 
 
 
 
 
 
54
 
55
- | Training Loss | Epoch | Step | Validation Loss | Accuracy |
56
- |:-------------:|:------:|:-----:|:---------------:|:--------:|
57
- | 1.1252 | 0.6908 | 5000 | 1.1206 | 0.4745 |
58
- | 1.0835 | 1.3816 | 10000 | 1.0814 | 0.4991 |
59
- | 1.0641 | 2.0724 | 15000 | 1.0639 | 0.5103 |
60
- | 1.0563 | 2.7632 | 20000 | 1.0547 | 0.5163 |
61
- | 1.0504 | 3.4540 | 25000 | 1.0486 | 0.5204 |
62
- | 1.0439 | 4.1448 | 30000 | 1.0439 | 0.5233 |
63
- | 1.0425 | 4.8356 | 35000 | 1.0407 | 0.5254 |
64
- | 1.0365 | 5.5264 | 40000 | 1.0380 | 0.5271 |
65
- | 1.0325 | 6.2172 | 45000 | 1.0361 | 0.5284 |
66
- | 1.0322 | 6.9080 | 50000 | 1.0341 | 0.5296 |
67
- | 1.0307 | 7.5988 | 55000 | 1.0328 | 0.5305 |
68
- | 1.0267 | 8.2896 | 60000 | 1.0316 | 0.5313 |
69
- | 1.0273 | 8.9804 | 65000 | 1.0306 | 0.5320 |
70
- | 1.027 | 9.6712 | 70000 | 1.0299 | 0.5324 |
71
 
 
72
 
73
- ### Framework versions
74
 
75
- - Transformers 4.52.0.dev0
76
- - Pytorch 2.3.0+cu121
77
- - Datasets 3.0.0
78
- - Tokenizers 0.21.1
 
9
  results: []
10
  ---
11
 
12
+ # DNA Language Model (Char-level, Human-only)
 
13
 
14
+ This model is a character-level GPT-style language model trained exclusively on **human DNA**. It uses a custom tokenizer with a vocabulary of `A`, `C`, `G`, `T`, and a special end-of-text token, trained to predict the next base in 1024-base sequences.
15
 
16
+ ---
17
+
18
+ ## 🧬 Model Summary
19
+
20
+ * **Objective**: Next-token prediction over human genomic sequences
21
+ * **Tokenization**: Character-level (A, C, G, T)
22
+ * **Training data**: [simecek/Human\_DNA\_v0](https://huggingface.co/datasets/simecek/Human_DNA_v0)
23
+ * **Sequence length**: 1024 tokens
24
+ * **Final Validation Loss**: 1.0299 nats/token
25
+ * **Final Validation Accuracy**: 53.24%
26
+
27
+ > This model outperforms classical compressors like GeCo on human DNA entropy, achieving \~1.486 bits per base.
28
+
29
+ ---
30
+
31
+ ## 🔧 Tokenizer
32
+
33
+ The tokenizer is a minimal GPT-2-style vocabulary:
34
+
35
+ ```json
36
+ {
37
+ "<|endoftext|>": 0,
38
+ "A": 1,
39
+ "C": 2,
40
+ "G": 3,
41
+ "T": 4
42
+ }
43
+ ```
44
+
45
+ * Implemented via `GPT2TokenizerFast`
46
+ * Merges file is empty (no BPE applied)
47
+ * Saved to the `dna_tokenizer/` directory for reuse
48
+
49
+ ---
50
+
51
+ ## 📊 Dataset Preprocessing
52
+
53
+ * Original dataset is cleaned to keep only `A`, `C`, `G`, `T`
54
+ * Sequences are chunked into segments of length 1024
55
+ * Very short chunks (<200bp) are discarded
56
+ * Resulting split sizes are saved as plain text in `processed_dna_data/`
57
 
58
+ If no validation set is provided, a 10% split is made from the training set.
59
 
60
+ ---
61
+
62
+ ## 🚀 Intended Uses
63
+
64
+ This model can be used for:
65
+
66
+ * DNA sequence generation
67
+ * Genomic representation learning
68
+ * Predictive modeling for base-level structure
69
+ * Downstream fine-tuning for biological classification tasks
70
+
71
+ ### Limitations
72
+
73
+ * Trained only on human genome; not suitable for other species
74
+ * No reverse-complement modeling
75
+ * No masked language modeling objective
76
+
77
+ ---
78
+
79
+ ## 🏋️ Training Details
80
 
81
+ ### Hyperparameters
82
 
83
+ * learning\_rate: 0.0003
84
+ * train\_batch\_size: 64
85
+ * eval\_batch\_size: 8
86
+ * total\_train\_batch\_size: 256 (across 4 GPUs)
87
+ * total\_eval\_batch\_size: 32
88
+ * optimizer: AdamW (betas=(0.9, 0.999), epsilon=1e-08)
89
+ * lr\_scheduler: Linear with 1000 warmup steps
90
+ * epochs: 10.0
91
+ * mixed\_precision: Native AMP
92
 
93
+ ### Hardware
94
 
95
+ * Multi-GPU training (4 devices)
96
+ * Transformers 4.52.0.dev0
97
+ * PyTorch 2.3.0+cu121
98
 
99
+ ---
100
 
101
+ ## 📈 Training Results
102
+
103
+ | Step | Epoch | Training Loss | Validation Loss | Accuracy |
104
+ | ----- | ----- | ------------- | --------------- | -------- |
105
+ | 5000 | 0.69 | 1.1252 | 1.1206 | 0.4745 |
106
+ | 10000 | 1.38 | 1.0835 | 1.0814 | 0.4991 |
107
+ | 15000 | 2.07 | 1.0641 | 1.0639 | 0.5103 |
108
+ | 20000 | 2.76 | 1.0563 | 1.0547 | 0.5163 |
109
+ | 25000 | 3.45 | 1.0504 | 1.0486 | 0.5204 |
110
+ | 30000 | 4.14 | 1.0439 | 1.0439 | 0.5233 |
111
+ | 35000 | 4.84 | 1.0425 | 1.0407 | 0.5254 |
112
+ | 40000 | 5.52 | 1.0365 | 1.0380 | 0.5271 |
113
+ | 45000 | 6.22 | 1.0325 | 1.0361 | 0.5284 |
114
+ | 50000 | 6.91 | 1.0322 | 1.0341 | 0.5296 |
115
+ | 55000 | 7.60 | 1.0307 | 1.0328 | 0.5305 |
116
+ | 60000 | 8.29 | 1.0267 | 1.0316 | 0.5313 |
117
+ | 65000 | 8.98 | 1.0273 | 1.0306 | 0.5320 |
118
+ | 70000 | 9.67 | 1.0270 | 1.0299 | 0.5324 |
119
 
120
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
121
 
122
+ ## 🔗 References
123
+
124
+ * Tokenizer inspired by GPT-2 minimal vocab
125
+ * Dataset: [simecek/Human\_DNA\_v0](https://huggingface.co/datasets/simecek/Human_DNA_v0)
126
+ * Transformers: [https://github.com/huggingface/transformers](https://github.com/huggingface/transformers)
127
+ * PyTorch: [https://pytorch.org/](https://pytorch.org/)
128
+
129
+ ---
130
 
131
+ ## 📄 Citation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
132
 
133
+ This model is part of ongoing research. A formal citation will be added when the associated paper is published.
134
 
135
+ If you use this model in academic work, please check back for updates.
136