---
inference: false
library_name: transformers
base_model: CohereLabs/aya-expanse-32b
language:
- uk
- crh
- en
- fr
- de
- es
- it
- pt
- ja
- ko
- zh
- ar
- el
- fa
- pl
- id
- cs
- he
- hi
- nl
- ro
- ru
- tr
- vi
datasets:
- lang-uk/malyuk
- QIRIM/crh_monocorpus
multilinguality:
- multililingual
tags:
- aya-tokenizer
- ukraine
- corpus-linguistics
pretty_name: “ayayay - ukrainized aya tokenizer”
---
# Ayayay — Malyuk-powered Ukrainianization for the Aya-Expanse Tokenizer
#### Ayayay is the first tokenizer that makes Ukrainian the core language in a multilingual vocabulary — while retaining as much compatibility with the original Aya-Expanse tokenizer as possible through careful (partially manual) token remapping.
## Feature Overview:
1. +118,985 new Cyrillic BPE tokens from [malyuk_qirim_tokenizer.json](https://huggingface.co/transhumanist-already-exists/ayayay_tokenizer/blob/main/malyuk_qirim_tokenizer.json) trained on full [Malyuk Ukrainian corpus](https://huggingface.co/datasets/lang-uk/malyuk/tree/main) plus the Cyrillic slice of the [Crimean Tatar corpus](https://huggingface.co/datasets/QIRIM/crh_monocorpus). Keeping only sub-words that appear ≥ 4 000 times.
2. Just the tail end of the Aya vocab (IDs > 150 000) and the 25K Cyrillic tokens already present in Aya were overwritten, keeping the rest of the vocabulary intact.
3. Unchanged tokens preserve their IDs, enabling direct reuse of Aya-Expanse embedding.
4. Vocab size, Special-token set, pre/post-tokenisation logic, and output formatting match Aya-Expanse one-for-one.
## Simple example
```python
tokenizer = AutoTokenizer.from_pretrained(
"transhumanist-already-exists/ayayay-tokenizer"
)
toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
print(toks.input_ids) # [123903, 175118, 167580, 196099] - only 4 tokens 💪🏻
```
## Metrics
Acknowledgement: evaluation results provided by [@Sofetory](https://huggingface.co/Sofetory).
||lang-uk/malyuk |100k texts|allenai/c4(en)| 100k texts|allenai/c4(es, fr, it, de) | 400k texts |QIRIM/crh_monocorpus(Cyrillic) | 94 texts |allenai/c4(ru) | 100k texts|allenai/c4(bg) | 100k texts|allenai/c4(be)| 100k texts|
|--------------------------------|-------------------------------------------------------------------------------------------------------------------|---------|---------------------|---------|-----------------------------------------------------------------------------------------------|---------|--------------------------------------------------------------------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|---------------------|---------|
|words count