vesteinn commited on
Commit
2016e4e
·
verified ·
1 Parent(s): bf21269

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -4
README.md CHANGED
@@ -44,7 +44,6 @@ The tokenizer is a minimal GPT-2-style vocabulary:
44
 
45
  * Implemented via `GPT2TokenizerFast`
46
  * Merges file is empty (no BPE applied)
47
- * Saved to the `dna_tokenizer/` directory for reuse
48
 
49
  ---
50
 
@@ -53,9 +52,7 @@ The tokenizer is a minimal GPT-2-style vocabulary:
53
  * Original dataset is cleaned to keep only `A`, `C`, `G`, `T`
54
  * Sequences are chunked into segments of length 1024
55
  * Very short chunks (<200bp) are discarded
56
- * Resulting split sizes are saved as plain text in `processed_dna_data/`
57
-
58
- If no validation set is provided, a 10% split is made from the training set.
59
 
60
  ---
61
 
 
44
 
45
  * Implemented via `GPT2TokenizerFast`
46
  * Merges file is empty (no BPE applied)
 
47
 
48
  ---
49
 
 
52
  * Original dataset is cleaned to keep only `A`, `C`, `G`, `T`
53
  * Sequences are chunked into segments of length 1024
54
  * Very short chunks (<200bp) are discarded
55
+ * A 10% split validation is made from the training set.
 
 
56
 
57
  ---
58