Upload README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,110 @@
|
|
1 |
-
---
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
inference: false
|
3 |
+
library_name: transformers
|
4 |
+
base_model: CohereLabs/aya-expanse-32b
|
5 |
+
language:
|
6 |
+
- uk
|
7 |
+
- crh
|
8 |
+
- en
|
9 |
+
- fr
|
10 |
+
- de
|
11 |
+
- es
|
12 |
+
- it
|
13 |
+
- pt
|
14 |
+
- ja
|
15 |
+
- ko
|
16 |
+
- zh
|
17 |
+
- ar
|
18 |
+
- el
|
19 |
+
- fa
|
20 |
+
- pl
|
21 |
+
- id
|
22 |
+
- cs
|
23 |
+
- he
|
24 |
+
- hi
|
25 |
+
- nl
|
26 |
+
- ro
|
27 |
+
- ru
|
28 |
+
- tr
|
29 |
+
- vi
|
30 |
+
datasets:
|
31 |
+
- lang-uk/malyuk
|
32 |
+
- QIRIM/crh_monocorpus
|
33 |
+
multilinguality:
|
34 |
+
- multililingual
|
35 |
+
tags:
|
36 |
+
- aya-tokenizer
|
37 |
+
- ukraine
|
38 |
+
- corpus-linguistics
|
39 |
+
pretty_name: “ayayay - ukrainized aya tokenizer”
|
40 |
+
---
|
41 |
+
# Ayayay — Malyuk-powered Ukrainianization for the Aya-Expanse Tokenizer
|
42 |
+
|
43 |
+
Ayayay is the first tokenizer to place Ukrainian at the center of a multilingual vocabulary—retaining as much origina tokenizer compatibility as possible through careful (partially manual) token remapping.
|
44 |
+
|
45 |
+
Feature Overview:
|
46 |
+
|
47 |
+
1. +118,985 new Cyrillic BPE merge from [malyuk_qirim_tokenizer.json](https://huggingface.co/transhumanist-already-exists/ayayay_tokenizer/blob/main/malyuk_qirim_tokenizer.json) trained on full [Malyuk Ukrainian corpus](https://huggingface.co/datasets/lang-uk/malyuk/tree/main) plus the Cyrillic slice of the [Crimean Tatar corpus](https://huggingface.co/datasets/QIRIM/crh_monocorpus) keeping only sub-words that appear ≥ 4 000 times.
|
48 |
+
2. Just the tail end of the Aya vocab (IDs > 150 000) and the 25K Cyrillic tokens already present in Aya were overwritten, keeping the rest of the vocabulary intact.
|
49 |
+
3. Unchanged tokens preserve their IDs, enabling direct reuse of Aya-Expanse embedding.
|
50 |
+
4. Special-token set, pre/post-tokenisation logic, and output formatting match Aya-Expanse one-for-one.
|
51 |
+
|
52 |
+
## Simple example
|
53 |
+
```python
|
54 |
+
tokenizer = AutoTokenizer.from_pretrained(
|
55 |
+
"transhumanist-already-exists/ayayay_tokenizer"
|
56 |
+
)
|
57 |
+
toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
|
58 |
+
print(toks.input_ids) # [123903, 175118, 167580, 196099] - only 4 tokens 💪🏻
|
59 |
+
```
|
60 |
+
|
61 |
+
## Metrics
|
62 |
+
|
63 |
+
I express my appreciation for the evaluation of the new tokenizer [@Sofetory](https://huggingface.co/Sofetory)
|
64 |
+
||lang-uk/malyuk |100k texts|allenai/c4(en)| 100k texts|allenai/c4(es, fr, it, de) | 400k texts |QIRIM/crh_monocorpus(Cyrillic) | 94 texts |allenai/c4(ru) | 100k texts|allenai/c4(bg) | 100k texts|allenai/c4(be)| 100k texts|
|
65 |
+
|--------------------------------|-------------------------------------------------------------------------------------------------------------------|---------|---------------------|---------|-----------------------------------------------------------------------------------------------|---------|--------------------------------------------------------------------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|---------------------|---------|
|
66 |
+
|words count <td colspan=2>22,898,164 |36,170,971 | |198,173,216 | |1,868,259 | |42,557,519 | |44,627,199 | |43,153,645 | |
|
67 |
+
||||||||||||||||
|
68 |
+
|tokenizers |tokens |toks/word|tokens |toks/word|tokens |toks/word|tokens |toks/word|tokens |toks/word|tokens |toks/word|tokens |toks/word|
|
69 |
+
|google/gemma-3-12b-it |57,388,402 |2.506 |47,285,432 |1.307 |354,241,840 |1.788 |6,240,944 |3.341 |95,520,817 |2.245 |103,950,626 |2.329 |131,398,147 |3.045 |
|
70 |
+
|Qwen/Qwen3-8B |84,408,084 |3.686 |46,884,593 |1.296 |395,581,536 |1.996 |7,956,741 |4.259 |116,115,062 |2.728 |132,597,427 |2.971 |173,571,099 |4.022 |
|
71 |
+
|meta-llama/Llama-3.1-8B-Instruct|57,226,997 |2.499 |46,085,724 |1.274 |382,143,751 |1.928 |7,386,873 |3.954 |104,974,733 |2.467 |119,123,733 |2.669 |150,189,294 |3.48 |
|
72 |
+
|microsoft/Phi-4-mini-instruct |59,447,036 |2.596 |45,423,925 |1.256 |335,188,687 |1.691 |5,995,822 |3.209 |91,824,464 |2.158 |102,472,523 |2.296 |119,587,038 |2.771 |
|
73 |
+
|CohereLabs/aya-expanse-8b |50,973,632 |2.226 |47,364,187 |1.309 |353,221,932 |1.782 |6,614,719 |3.541 |93,089,697 |2.187 |112,612,668 |2.523 |141,262,943 |3.273 |
|
74 |
+
|ayayay_tokenizer |37,094,157 |1.62🤩 |48,288,882 |1.335 |372,587,959 |1.88 |4,238,587 |2.269 |107,331,167 |2.522 |114,292,191 |2.561 |133,618,186 |3.096 |
|
75 |
+
|Comments <td colspan=2> Significant 27 % improvement over the Aya-Expanse baseline; absolute leader in Ukrainian tokenization.<td colspan=2>Tokens-per-word for English rises by less than 4 % compared with the baseline.<td colspan=2>The tokenizer retains strong multilingual capabilities <td colspan=2>Shows significant improvement on QIRIM Cyrillic versus the original aya and other tokenizers<td colspan=2>Russian efficiency drops, owing to the Ukrainian-centric changes, but still beats Qwen.<td colspan=4> Other Cyrillic languages, such as Bulgarian and Belarusian, perform well after the token replacement; Belarusian improves especially noticeably. | |
|
76 |
+
|
77 |
+
|
78 |
+
## Contents
|
79 |
+
|
80 |
+
- **`tokenizer.json`** Byte‐level tokenizer spec (vocab, merges, model settings).
|
81 |
+
|
82 |
+
- **`tokenizer_utf8.json`** Human-readable dump: UTF-8-decoded sub-tokens and merge rules, for corpus-linguistic inspection.
|
83 |
+
|
84 |
+
- **`malyuk_qirim_tokenizer.json`** Aya-style tokenizer trained on the full Malyuk Ukrainian corpus plus Cyrillic QIRIM (100 : 1 ratio), with min_frequency = 4_000.
|
85 |
+
|
86 |
+
- **`merge_info.json`** Lists the replaced Aya token IDs and the IDs of the added Malyuk tokens in [malyuk_qirim_tokenizer](https://huggingface.co/transhumanist-already-exists/ayayay_tokenizer/blob/main/malyuk_qirim_tokenizer.json).
|
87 |
+
|
88 |
+
- **`tokenizer_config.json`** Configuration metadata.
|
89 |
+
|
90 |
+
- **`special_tokens_map.json`** Mapping of special token (The same with Aya).
|
91 |
+
|
92 |
+
## Initialisation of embeddings for new tokens in Aya-Expanse models
|
93 |
+
Some tokens are identical to those in the original Aya-Expanse tokenizer. For the newly added tokens, you can initialise embeddings with tools such as [Focus](https://github.com/konstantinjdobler/focus/tree/main) and [Zett](https://github.com/bminixhofer/zett). The simplest—and often effective—alternative is to initialise the new embeddings randomly and train them with a warm-up schedule.
|
94 |
+
|
95 |
+
## Acknowledgement: Metrics evaluation results provided by @Sofetory.
|
96 |
+
|
97 |
+
## Citation
|
98 |
+
|
99 |
+
**BibTeX:**
|
100 |
+
|
101 |
+
```bibtex
|
102 |
+
@misc{zaduha2025post9163,
|
103 |
+
author = "{Bohdan Didenko}",
|
104 |
+
title = "{Post \#9163 on Telegram Channel Zaduha}",
|
105 |
+
howpublished = "\url{https://t.me/zaduha/9163}",
|
106 |
+
month = june,
|
107 |
+
year = {2025},
|
108 |
+
note = "[Online; accessed 8 June 2025]"
|
109 |
+
}
|
110 |
+
```
|