Commit
b4e5c7c
·
verified ·
1 Parent(s): 7202cb2

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +110 -3
README.md CHANGED
@@ -1,3 +1,110 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ inference: false
3
+ library_name: transformers
4
+ base_model: CohereLabs/aya-expanse-32b
5
+ language:
6
+ - uk
7
+ - crh
8
+ - en
9
+ - fr
10
+ - de
11
+ - es
12
+ - it
13
+ - pt
14
+ - ja
15
+ - ko
16
+ - zh
17
+ - ar
18
+ - el
19
+ - fa
20
+ - pl
21
+ - id
22
+ - cs
23
+ - he
24
+ - hi
25
+ - nl
26
+ - ro
27
+ - ru
28
+ - tr
29
+ - vi
30
+ datasets:
31
+ - lang-uk/malyuk
32
+ - QIRIM/crh_monocorpus
33
+ multilinguality:
34
+ - multililingual
35
+ tags:
36
+ - aya-tokenizer
37
+ - ukraine
38
+ - corpus-linguistics
39
+ pretty_name: “ayayay - ukrainized aya tokenizer”
40
+ ---
41
+ # Ayayay — Malyuk-powered Ukrainianization for the Aya-Expanse Tokenizer
42
+
43
+ Ayayay is the first tokenizer to place Ukrainian at the center of a multilingual vocabulary—retaining as much origina tokenizer compatibility as possible through careful (partially manual) token remapping.
44
+
45
+ Feature Overview:
46
+
47
+ 1. +118,985 new Cyrillic BPE merge from [malyuk_qirim_tokenizer.json](https://huggingface.co/transhumanist-already-exists/ayayay_tokenizer/blob/main/malyuk_qirim_tokenizer.json) trained on full [Malyuk Ukrainian corpus](https://huggingface.co/datasets/lang-uk/malyuk/tree/main) plus the Cyrillic slice of the [Crimean Tatar corpus](https://huggingface.co/datasets/QIRIM/crh_monocorpus) keeping only sub-words that appear ≥ 4 000 times.
48
+ 2. Just the tail end of the Aya vocab (IDs > 150 000) and the 25K Cyrillic tokens already present in Aya were overwritten, keeping the rest of the vocabulary intact.
49
+ 3. Unchanged tokens preserve their IDs, enabling direct reuse of Aya-Expanse embedding.
50
+ 4. Special-token set, pre/post-tokenisation logic, and output formatting match Aya-Expanse one-for-one.
51
+
52
+ ## Simple example
53
+ ```python
54
+ tokenizer = AutoTokenizer.from_pretrained(
55
+ "transhumanist-already-exists/ayayay_tokenizer"
56
+ )
57
+ toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
58
+ print(toks.input_ids) # [123903, 175118, 167580, 196099] - only 4 tokens 💪🏻
59
+ ```
60
+
61
+ ## Metrics
62
+
63
+ I express my appreciation for the evaluation of the new tokenizer [@Sofetory](https://huggingface.co/Sofetory)
64
+ ||lang-uk/malyuk |100k texts|allenai/c4(en)| 100k texts|allenai/c4(es, fr, it, de) | 400k texts |QIRIM/crh_monocorpus(Cyrillic) | 94 texts |allenai/c4(ru) | 100k texts|allenai/c4(bg) | 100k texts|allenai/c4(be)| 100k texts|
65
+ |--------------------------------|-------------------------------------------------------------------------------------------------------------------|---------|---------------------|---------|-----------------------------------------------------------------------------------------------|---------|--------------------------------------------------------------------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|---------------------|---------|
66
+ |words count <td colspan=2>22,898,164 |36,170,971 | |198,173,216 | |1,868,259 | |42,557,519 | |44,627,199 | |43,153,645 | |
67
+ ||||||||||||||||
68
+ |tokenizers |tokens |toks/word|tokens |toks/word|tokens |toks/word|tokens |toks/word|tokens |toks/word|tokens |toks/word|tokens |toks/word|
69
+ |google/gemma-3-12b-it |57,388,402 |2.506 |47,285,432 |1.307 |354,241,840 |1.788 |6,240,944 |3.341 |95,520,817 |2.245 |103,950,626 |2.329 |131,398,147 |3.045 |
70
+ |Qwen/Qwen3-8B |84,408,084 |3.686 |46,884,593 |1.296 |395,581,536 |1.996 |7,956,741 |4.259 |116,115,062 |2.728 |132,597,427 |2.971 |173,571,099 |4.022 |
71
+ |meta-llama/Llama-3.1-8B-Instruct|57,226,997 |2.499 |46,085,724 |1.274 |382,143,751 |1.928 |7,386,873 |3.954 |104,974,733 |2.467 |119,123,733 |2.669 |150,189,294 |3.48 |
72
+ |microsoft/Phi-4-mini-instruct |59,447,036 |2.596 |45,423,925 |1.256 |335,188,687 |1.691 |5,995,822 |3.209 |91,824,464 |2.158 |102,472,523 |2.296 |119,587,038 |2.771 |
73
+ |CohereLabs/aya-expanse-8b |50,973,632 |2.226 |47,364,187 |1.309 |353,221,932 |1.782 |6,614,719 |3.541 |93,089,697 |2.187 |112,612,668 |2.523 |141,262,943 |3.273 |
74
+ |ayayay_tokenizer |37,094,157 |1.62🤩 |48,288,882 |1.335 |372,587,959 |1.88 |4,238,587 |2.269 |107,331,167 |2.522 |114,292,191 |2.561 |133,618,186 |3.096 |
75
+ |Comments <td colspan=2> Significant 27 % improvement over the Aya-Expanse baseline; absolute leader in Ukrainian tokenization.<td colspan=2>Tokens-per-word for English rises by less than 4 % compared with the baseline.<td colspan=2>The tokenizer retains strong multilingual capabilities <td colspan=2>Shows significant improvement on QIRIM Cyrillic versus the original aya and other tokenizers<td colspan=2>Russian efficiency drops, owing to the Ukrainian-centric changes, but still beats Qwen.<td colspan=4> Other Cyrillic languages, such as Bulgarian and Belarusian, perform well after the token replacement; Belarusian improves especially noticeably. | |
76
+
77
+
78
+ ## Contents
79
+
80
+ - **`tokenizer.json`** Byte‐level tokenizer spec (vocab, merges, model settings).
81
+
82
+ - **`tokenizer_utf8.json`** Human-readable dump: UTF-8-decoded sub-tokens and merge rules, for corpus-linguistic inspection.
83
+
84
+ - **`malyuk_qirim_tokenizer.json`** Aya-style tokenizer trained on the full Malyuk Ukrainian corpus plus Cyrillic QIRIM (100 : 1 ratio), with min_frequency = 4_000.
85
+
86
+ - **`merge_info.json`** Lists the replaced Aya token IDs and the IDs of the added Malyuk tokens in [malyuk_qirim_tokenizer](https://huggingface.co/transhumanist-already-exists/ayayay_tokenizer/blob/main/malyuk_qirim_tokenizer.json).
87
+
88
+ - **`tokenizer_config.json`** Configuration metadata.
89
+
90
+ - **`special_tokens_map.json`** Mapping of special token (The same with Aya).
91
+
92
+ ## Initialisation of embeddings for new tokens in Aya-Expanse models
93
+ Some tokens are identical to those in the original Aya-Expanse tokenizer. For the newly added tokens, you can initialise embeddings with tools such as [Focus](https://github.com/konstantinjdobler/focus/tree/main) and [Zett](https://github.com/bminixhofer/zett). The simplest—and often effective—alternative is to initialise the new embeddings randomly and train them with a warm-up schedule.
94
+
95
+ ## Acknowledgement: Metrics evaluation results provided by @Sofetory.
96
+
97
+ ## Citation
98
+
99
+ **BibTeX:**
100
+
101
+ ```bibtex
102
+ @misc{zaduha2025post9163,
103
+ author = "{Bohdan Didenko}",
104
+ title = "{Post \#9163 on Telegram Channel Zaduha}",
105
+ howpublished = "\url{https://t.me/zaduha/9163}",
106
+ month = june,
107
+ year = {2025},
108
+ note = "[Online; accessed 8 June 2025]"
109
+ }
110
+ ```