File size: 11,751 Bytes

b4e5c7c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4241aed
b6b54b1
2c15ae4
b4e5c7c
4241aed
b4e5c7c
5cb49d2
b4e5c7c
 
66f2804
b4e5c7c
 
 
 
4241aed
b4e5c7c
 
 
 
 
 
b6b54b1
 
b4e5c7c
 
50b53cc
b4e5c7c
 
 
 
 
c171147
 
 
 
b4e5c7c
 
 
 
3a98085
b4e5c7c
3a98085
b4e5c7c
3a98085
b4e5c7c
3a98085
b4e5c7c
3a98085
b4e5c7c
3a98085
b4e5c7c
 
 
 
 
 
 
 
 
2c15ae4
b4e5c7c
2c15ae4
 
b4e5c7c

---
inference: false
library_name: transformers
base_model: CohereLabs/aya-expanse-32b
language:
  - uk
  - crh
  - en
  - fr
  - de
  - es
  - it
  - pt
  - ja
  - ko
  - zh
  - ar
  - el
  - fa
  - pl
  - id
  - cs
  - he
  - hi
  - nl
  - ro
  - ru
  - tr
  - vi
datasets:
  - lang-uk/malyuk
  - QIRIM/crh_monocorpus
multilinguality:
  - multililingual
tags:
  - aya-tokenizer
  - ukraine
  - corpus-linguistics
pretty_name: “ayayay - ukrainized aya tokenizer”  
---
# Ayayay — Malyuk-powered Ukrainianization for the Aya-Expanse Tokenizer

<img src="ayayay.png" width="400px" style="margin-left:'auto' margin-right:'auto' display:'block'"/>

#### Ayayay is the first tokenizer that makes Ukrainian the core language in a multilingual vocabulary — while retaining as much compatibility with the original Aya-Expanse tokenizer as possible through careful (partially manual) token remapping.

## Feature Overview:

1. +118,985 new Cyrillic BPE tokens from [malyuk_qirim_tokenizer.json](https://huggingface.co/transhumanist-already-exists/ayayay_tokenizer/blob/main/malyuk_qirim_tokenizer.json) trained on full [Malyuk Ukrainian corpus](https://huggingface.co/datasets/lang-uk/malyuk/tree/main) plus the Cyrillic slice of the [Crimean Tatar corpus](https://huggingface.co/datasets/QIRIM/crh_monocorpus). Keeping only sub-words that appear ≥ 4 000 times. 
2. Just the tail end of the Aya vocab (IDs > 150 000) and the 25K Cyrillic tokens already present in Aya were overwritten, keeping the rest of the vocabulary intact.
3. Unchanged tokens preserve their IDs, enabling direct reuse of Aya-Expanse embedding.
4. Vocab size, Special-token set, pre/post-tokenisation logic, and output formatting match Aya-Expanse one-for-one.

## Simple example
```python
tokenizer = AutoTokenizer.from_pretrained(
    "transhumanist-already-exists/ayayay-tokenizer"
)
toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
print(toks.input_ids) # [123903, 175118, 167580, 196099] - only 4 tokens 💪🏻
```

## Metrics
	
Acknowledgement: evaluation results provided by [@Sofetory](https://huggingface.co/Sofetory).
||lang-uk/malyuk                                                                                              |100k texts|allenai/c4(en)| 100k texts|allenai/c4(es, fr, it, de) | 400k texts                                                              |QIRIM/crh_monocorpus(Cyrillic) | 94 texts                                                   |allenai/c4(ru)                                                                                                             | 100k texts|allenai/c4(bg)                                                                                                                                                                    | 100k texts|allenai/c4(be)| 100k texts|
|--------------------------------|-------------------------------------------------------------------------------------------------------------------|---------|---------------------|---------|-----------------------------------------------------------------------------------------------|---------|--------------------------------------------------------------------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|---------------------|---------|
|words count                      <td colspan=2>22,898,164                                                                                                                 <td colspan=2>36,170,971         <td colspan=2>198,173,216        <td colspan=2>1,868,259             <td colspan=2>42,557,519                <td colspan=2>44,627,199                           <td colspan=2>43,153,645                |
||||||||||||||||
|tokenizers             |tokens                                                                                                             |toks/word|tokens               |toks/word|tokens                                                                                         |toks/word|tokens                                                                                |toks/word|tokens                                                                                                                            |toks/word|tokens                                                                                                                                                                                  |toks/word|tokens               |toks/word|
|google/gemma-3-12b-it           |57,388,402                                                                                                         |2.506    |47,285,432           |1.307    |354,241,840                                                                                    |1.788    |6,240,944                                                                             |3.341    |95,520,817                                                                                                                        |2.245    |103,950,626                                                                                                                                                                             |2.329    |131,398,147          |3.045    |
|Qwen/Qwen3-8B                   |84,408,084                                                                                                         |3.686    |46,884,593           |1.296    |395,581,536                                                                                    |1.996    |7,956,741                                                                             |4.259    |116,115,062                                                                                                                       |2.728    |132,597,427                                                                                                                                                                             |2.971    |173,571,099          |4.022    |
|meta-llama/Llama-3.1-8B-Instruct|57,226,997                                                                                                         |2.499    |46,085,724           |1.274    |382,143,751                                                                                    |1.928    |7,386,873                                                                             |3.954    |104,974,733                                                                                                                       |2.467    |119,123,733                                                                                                                                                                             |2.669    |150,189,294          |3.48     |
|microsoft/Phi-4-mini-instruct   |59,447,036                                                                                                         |2.596    |45,423,925           |**1.256**    |335,188,687                                                                                    |**1.691**    |5,995,822                                                                             |3.209    |91,824,464                                                                                                                        |**2.158**    |102,472,523                                                                                                                                                                             |2.296    |119,587,038          |**2.771**    |
|CohereLabs/aya-expanse-8b       |50,973,632                                                                                                         |2.226    |47,364,187           |1.309    |353,221,932                                                                                    |1.782    |6,614,719                                                                             |3.541    |93,089,697                                                                                                                        |2.187    |112,612,668                                                                                                                                                                             |**2.523**    |141,262,943          |3.273    |
|**ayayay-tokenizer (Ours)**                |37,094,157                                                                                              |**1.62**🤩   |48,288,882           |1.335    |372,587,959                                                                                    |1.88     |4,238,587                                                                             |**2.269**    |107,331,167                                                                                                                       |2.522    |114,292,191                                                                                                                                                                             |2.561    |133,618,186          |3.096    |
|Comments                        <td colspan=2> Significant 27 % improvement over the Aya-Expanse baseline; absolute leader in Ukrainian tokenization.<td colspan=2>Tokens-per-word for English rises by less than 4 % compared with the baseline.<td colspan=2>Ayayay tokenizer retains strong multilingual capabilities <td colspan=2>Shows significant improvement on QIRIM Cyrillic versus the original aya and other tokenizers<td colspan=2>Russian efficiency drops, owing to the Ukrainian-centric changes, but still beats Qwen.<td colspan=4> Other Cyrillic languages, such as Bulgarian and Belarusian, perform well after the token replacement; Belarusian improves especially noticeably.|


## Contents

- [tokenizer.json](tokenizer.json): Byte‐level tokenizer spec (vocab, merges, model settings).

- [tokenizer_utf8.json](tokenizer_utf8.json): Human-readable dump: UTF-8-decoded sub-tokens and merge rules, for corpus-linguistic inspection.

- [malyuk_qirim_tokenizer.json](malyuk_qirim_tokenizer.json): Aya-style tokenizer trained on the full Malyuk Ukrainian corpus plus Cyrillic QIRIM (100 : 1 ratio), with min_frequency = 4_000.

- [merge_info.json](merge_info.json): Lists the replaced Aya token IDs and the IDs of the added Malyuk tokens in [malyuk_qirim_tokenizer](https://huggingface.co/transhumanist-already-exists/ayayay_tokenizer/blob/main/malyuk_qirim_tokenizer.json).

- [tokenizer_config.json](tokenizer_config.json): Configuration metadata.

- [special_tokens_map.json](special_tokens_map.json): Mapping of special token (The same with Aya).

## Initialisation of embeddings for new tokens in Aya-Expanse models
Some tokens are identical to those in the original Aya-Expanse tokenizer. For the newly added tokens, you can initialise embeddings with tools such as [Focus](https://github.com/konstantinjdobler/focus/tree/main) and [Zett](https://github.com/bminixhofer/zett). The simplest—and often effective—alternative is to initialise the new embeddings randomly and train them with a warm-up schedule.

## Citation

**BibTeX:**

```bibtex
@misc{zaduha2025post9164,
  author       = "{Bohdan Didenko}",
  title        = "{Post \#9164 on Telegram Channel Zaduha}",
  howpublished = "\url{https://t.me/zaduha/9164}",
  month        = june,
  year         = {2025},
  note         = "[Online; accessed 8 June 2025]"
}
```