Commit
66f2804
·
verified ·
1 Parent(s): 2c15ae4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -49,7 +49,7 @@ pretty_name: “ayayay - ukrainized aya tokenizer”
49
  1. +118,985 new Cyrillic BPE merge from [malyuk_qirim_tokenizer.json](https://huggingface.co/transhumanist-already-exists/ayayay_tokenizer/blob/main/malyuk_qirim_tokenizer.json) trained on full [Malyuk Ukrainian corpus](https://huggingface.co/datasets/lang-uk/malyuk/tree/main) plus the Cyrillic slice of the [Crimean Tatar corpus](https://huggingface.co/datasets/QIRIM/crh_monocorpus). Keeping only sub-words that appear ≥ 4 000 times.
50
  2. Just the tail end of the Aya vocab (IDs > 150 000) and the 25K Cyrillic tokens already present in Aya were overwritten, keeping the rest of the vocabulary intact.
51
  3. Unchanged tokens preserve their IDs, enabling direct reuse of Aya-Expanse embedding.
52
- 4. Special-token set, pre/post-tokenisation logic, and output formatting match Aya-Expanse one-for-one.
53
 
54
  ## Simple example
55
  ```python
 
49
  1. +118,985 new Cyrillic BPE merge from [malyuk_qirim_tokenizer.json](https://huggingface.co/transhumanist-already-exists/ayayay_tokenizer/blob/main/malyuk_qirim_tokenizer.json) trained on full [Malyuk Ukrainian corpus](https://huggingface.co/datasets/lang-uk/malyuk/tree/main) plus the Cyrillic slice of the [Crimean Tatar corpus](https://huggingface.co/datasets/QIRIM/crh_monocorpus). Keeping only sub-words that appear ≥ 4 000 times.
50
  2. Just the tail end of the Aya vocab (IDs > 150 000) and the 25K Cyrillic tokens already present in Aya were overwritten, keeping the rest of the vocabulary intact.
51
  3. Unchanged tokens preserve their IDs, enabling direct reuse of Aya-Expanse embedding.
52
+ 4. Vocab size, Special-token set, pre/post-tokenisation logic, and output formatting match Aya-Expanse one-for-one.
53
 
54
  ## Simple example
55
  ```python