|
--- |
|
inference: false |
|
library_name: transformers |
|
base_model: CohereLabs/aya-expanse-32b |
|
language: |
|
- uk |
|
- crh |
|
- en |
|
- fr |
|
- de |
|
- es |
|
- it |
|
- pt |
|
- ja |
|
- ko |
|
- zh |
|
- ar |
|
- el |
|
- fa |
|
- pl |
|
- id |
|
- cs |
|
- he |
|
- hi |
|
- nl |
|
- ro |
|
- ru |
|
- tr |
|
- vi |
|
datasets: |
|
- lang-uk/malyuk |
|
- QIRIM/crh_monocorpus |
|
multilinguality: |
|
- multililingual |
|
tags: |
|
- aya-tokenizer |
|
- ukraine |
|
- corpus-linguistics |
|
pretty_name: “ayayay - ukrainized aya tokenizer” |
|
--- |
|
# Ayayay — Malyuk-powered Ukrainianization for the Aya-Expanse Tokenizer |
|
|
|
<img src="ayayay.png" width="400px" style="margin-left:'auto' margin-right:'auto' display:'block'"/> |
|
|
|
#### Ayayay is the first tokenizer that makes Ukrainian the core language in a multilingual vocabulary — while retaining as much compatibility with the original Aya-Expanse tokenizer as possible through careful (partially manual) token remapping. |
|
|
|
## Feature Overview: |
|
|
|
1. +118,985 new Cyrillic BPE tokens from [malyuk_qirim_tokenizer.json](https://huggingface.co/transhumanist-already-exists/ayayay_tokenizer/blob/main/malyuk_qirim_tokenizer.json) trained on full [Malyuk Ukrainian corpus](https://huggingface.co/datasets/lang-uk/malyuk/tree/main) plus the Cyrillic slice of the [Crimean Tatar corpus](https://huggingface.co/datasets/QIRIM/crh_monocorpus). Keeping only sub-words that appear ≥ 4 000 times. |
|
2. Just the tail end of the Aya vocab (IDs > 150 000) and the 25K Cyrillic tokens already present in Aya were overwritten, keeping the rest of the vocabulary intact. |
|
3. Unchanged tokens preserve their IDs, enabling direct reuse of Aya-Expanse embedding. |
|
4. Vocab size, Special-token set, pre/post-tokenisation logic, and output formatting match Aya-Expanse one-for-one. |
|
|
|
## Simple example |
|
```python |
|
tokenizer = AutoTokenizer.from_pretrained( |
|
"transhumanist-already-exists/ayayay-tokenizer" |
|
) |
|
toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False) |
|
print(toks.input_ids) # [123903, 175118, 167580, 196099] - only 4 tokens 💪🏻 |
|
``` |
|
|
|
## Metrics |
|
|
|
Acknowledgement: evaluation results provided by [@Sofetory](https://huggingface.co/Sofetory). |
|
||lang-uk/malyuk |100k texts|allenai/c4(en)| 100k texts|allenai/c4(es, fr, it, de) | 400k texts |QIRIM/crh_monocorpus(Cyrillic) | 94 texts |allenai/c4(ru) | 100k texts|allenai/c4(bg) | 100k texts|allenai/c4(be)| 100k texts| |
|
|--------------------------------|-------------------------------------------------------------------------------------------------------------------|---------|---------------------|---------|-----------------------------------------------------------------------------------------------|---------|--------------------------------------------------------------------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|---------------------|---------| |
|
|words count <td colspan=2>22,898,164 <td colspan=2>36,170,971 <td colspan=2>198,173,216 <td colspan=2>1,868,259 <td colspan=2>42,557,519 <td colspan=2>44,627,199 <td colspan=2>43,153,645 | |
|
|||||||||||||||| |
|
|tokenizers |tokens |toks/word|tokens |toks/word|tokens |toks/word|tokens |toks/word|tokens |toks/word|tokens |toks/word|tokens |toks/word| |
|
|google/gemma-3-12b-it |57,388,402 |2.506 |47,285,432 |1.307 |354,241,840 |1.788 |6,240,944 |3.341 |95,520,817 |2.245 |103,950,626 |2.329 |131,398,147 |3.045 | |
|
|Qwen/Qwen3-8B |84,408,084 |3.686 |46,884,593 |1.296 |395,581,536 |1.996 |7,956,741 |4.259 |116,115,062 |2.728 |132,597,427 |2.971 |173,571,099 |4.022 | |
|
|meta-llama/Llama-3.1-8B-Instruct|57,226,997 |2.499 |46,085,724 |1.274 |382,143,751 |1.928 |7,386,873 |3.954 |104,974,733 |2.467 |119,123,733 |2.669 |150,189,294 |3.48 | |
|
|microsoft/Phi-4-mini-instruct |59,447,036 |2.596 |45,423,925 |**1.256** |335,188,687 |**1.691** |5,995,822 |3.209 |91,824,464 |**2.158** |102,472,523 |2.296 |119,587,038 |**2.771** | |
|
|CohereLabs/aya-expanse-8b |50,973,632 |2.226 |47,364,187 |1.309 |353,221,932 |1.782 |6,614,719 |3.541 |93,089,697 |2.187 |112,612,668 |**2.523** |141,262,943 |3.273 | |
|
|**ayayay-tokenizer (Ours)** |37,094,157 |**1.62**🤩 |48,288,882 |1.335 |372,587,959 |1.88 |4,238,587 |**2.269** |107,331,167 |2.522 |114,292,191 |2.561 |133,618,186 |3.096 | |
|
|Comments <td colspan=2> Significant 27 % improvement over the Aya-Expanse baseline; absolute leader in Ukrainian tokenization.<td colspan=2>Tokens-per-word for English rises by less than 4 % compared with the baseline.<td colspan=2>Ayayay tokenizer retains strong multilingual capabilities <td colspan=2>Shows significant improvement on QIRIM Cyrillic versus the original aya and other tokenizers<td colspan=2>Russian efficiency drops, owing to the Ukrainian-centric changes, but still beats Qwen.<td colspan=4> Other Cyrillic languages, such as Bulgarian and Belarusian, perform well after the token replacement; Belarusian improves especially noticeably.| |
|
|
|
|
|
## Contents |
|
|
|
- [tokenizer.json](tokenizer.json): Byte‐level tokenizer spec (vocab, merges, model settings). |
|
|
|
- [tokenizer_utf8.json](tokenizer_utf8.json): Human-readable dump: UTF-8-decoded sub-tokens and merge rules, for corpus-linguistic inspection. |
|
|
|
- [malyuk_qirim_tokenizer.json](malyuk_qirim_tokenizer.json): Aya-style tokenizer trained on the full Malyuk Ukrainian corpus plus Cyrillic QIRIM (100 : 1 ratio), with min_frequency = 4_000. |
|
|
|
- [merge_info.json](merge_info.json): Lists the replaced Aya token IDs and the IDs of the added Malyuk tokens in [malyuk_qirim_tokenizer](https://huggingface.co/transhumanist-already-exists/ayayay_tokenizer/blob/main/malyuk_qirim_tokenizer.json). |
|
|
|
- [tokenizer_config.json](tokenizer_config.json): Configuration metadata. |
|
|
|
- [special_tokens_map.json](special_tokens_map.json): Mapping of special token (The same with Aya). |
|
|
|
## Initialisation of embeddings for new tokens in Aya-Expanse models |
|
Some tokens are identical to those in the original Aya-Expanse tokenizer. For the newly added tokens, you can initialise embeddings with tools such as [Focus](https://github.com/konstantinjdobler/focus/tree/main) and [Zett](https://github.com/bminixhofer/zett). The simplest—and often effective—alternative is to initialise the new embeddings randomly and train them with a warm-up schedule. |
|
|
|
## Citation |
|
|
|
**BibTeX:** |
|
|
|
```bibtex |
|
@misc{zaduha2025post9164, |
|
author = "{Bohdan Didenko}", |
|
title = "{Post \#9164 on Telegram Channel Zaduha}", |
|
howpublished = "\url{https://t.me/zaduha/9164}", |
|
month = june, |
|
year = {2025}, |
|
note = "[Online; accessed 8 June 2025]" |
|
} |
|
``` |