transhumanist-already-exists
/

ayayay-tokenizer

Transformers

aya-tokenizer

ukraine

corpus-linguistics

Model card Files Files and versions Community

transhumanist-already-exists commited on Jun 8

Commit

b4e5c7c

verified ·

1 Parent(s): 7202cb2

Upload README.md

Browse files

Files changed (1) hide show

README.md +110 -3

README.md CHANGED Viewed

@@ -1,3 +1,110 @@
----
-license: cc-by-nc-4.0
----

+---
+inference: false
+library_name: transformers
+base_model: CohereLabs/aya-expanse-32b
+language:
+  - uk
+  - crh
+  - en
+  - fr
+  - de
+  - es
+  - it
+  - pt
+  - ja
+  - ko
+  - zh
+  - ar
+  - el
+  - fa
+  - pl
+  - id
+  - cs
+  - he
+  - hi
+  - nl
+  - ro
+  - ru
+  - tr
+  - vi
+datasets:
+  - lang-uk/malyuk
+  - QIRIM/crh_monocorpus
+multilinguality:
+  - multililingual
+tags:
+  - aya-tokenizer
+  - ukraine
+  - corpus-linguistics
+pretty_name: “ayayay - ukrainized aya tokenizer”
+---
+# Ayayay — Malyuk-powered Ukrainianization for the Aya-Expanse Tokenizer
+Ayayay is the first tokenizer to place Ukrainian at the center of a multilingual vocabulary—retaining as much origina tokenizer compatibility as possible through careful (partially manual) token remapping.
+Feature Overview:
+1. +118,985 new Cyrillic BPE merge from [malyuk_qirim_tokenizer.json](https://huggingface.co/transhumanist-already-exists/ayayay_tokenizer/blob/main/malyuk_qirim_tokenizer.json) trained on full [Malyuk Ukrainian corpus](https://huggingface.co/datasets/lang-uk/malyuk/tree/main) plus the Cyrillic slice of the [Crimean Tatar corpus](https://huggingface.co/datasets/QIRIM/crh_monocorpus) keeping only sub-words that appear ≥ 4 000 times.
+2. Just the tail end of the Aya vocab (IDs > 150 000) and the 25K Cyrillic tokens already present in Aya were overwritten, keeping the rest of the vocabulary intact.
+3. Unchanged tokens preserve their IDs, enabling direct reuse of Aya-Expanse embedding.
+4. Special-token set, pre/post-tokenisation logic, and output formatting match Aya-Expanse one-for-one.
+## Simple example
+```python
+tokenizer = AutoTokenizer.from_pretrained(
+    "transhumanist-already-exists/ayayay_tokenizer"
+)
+toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
+print(toks.input_ids) # [123903, 175118, 167580, 196099] - only 4 tokens 💪🏻
+```
+## Metrics
+I express my appreciation for the evaluation of the new tokenizer [@Sofetory](https://huggingface.co/Sofetory)
+||lang-uk/malyuk                                                                                              |100k texts|allenai/c4(en)| 100k texts|allenai/c4(es, fr, it, de) | 400k texts                                                              |QIRIM/crh_monocorpus(Cyrillic) | 94 texts                                                   |allenai/c4(ru)                                                                                                             | 100k texts|allenai/c4(bg)                                                                                                                                                                    | 100k texts|allenai/c4(be)| 100k texts|
+|--------------------------------|-------------------------------------------------------------------------------------------------------------------|---------|---------------------|---------|-----------------------------------------------------------------------------------------------|---------|--------------------------------------------------------------------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|---------------------|---------|
+|words count                      <td colspan=2>22,898,164                                                                                                                 |36,170,971           |         |198,173,216                                                                                    |         |1,868,259                                                                             |         |42,557,519                                                                                                                        |         |44,627,199                                                                                                                                                                              |         |43,153,645           |         |
+||||||||||||||||
+|tokenizers             |tokens                                                                                                             |toks/word|tokens               |toks/word|tokens                                                                                         |toks/word|tokens                                                                                |toks/word|tokens                                                                                                                            |toks/word|tokens                                                                                                                                                                                  |toks/word|tokens               |toks/word|
+|google/gemma-3-12b-it           |57,388,402                                                                                                         |2.506    |47,285,432           |1.307    |354,241,840                                                                                    |1.788    |6,240,944                                                                             |3.341    |95,520,817                                                                                                                        |2.245    |103,950,626                                                                                                                                                                             |2.329    |131,398,147          |3.045    |
+|Qwen/Qwen3-8B                   |84,408,084                                                                                                         |3.686    |46,884,593           |1.296    |395,581,536                                                                                    |1.996    |7,956,741                                                                             |4.259    |116,115,062                                                                                                                       |2.728    |132,597,427                                                                                                                                                                             |2.971    |173,571,099          |4.022    |
+|meta-llama/Llama-3.1-8B-Instruct|57,226,997                                                                                                         |2.499    |46,085,724           |1.274    |382,143,751                                                                                    |1.928    |7,386,873                                                                             |3.954    |104,974,733                                                                                                                       |2.467    |119,123,733                                                                                                                                                                             |2.669    |150,189,294          |3.48     |
+|microsoft/Phi-4-mini-instruct   |59,447,036                                                                                                         |2.596    |45,423,925           |1.256    |335,188,687                                                                                    |1.691    |5,995,822                                                                             |3.209    |91,824,464                                                                                                                        |2.158    |102,472,523                                                                                                                                                                             |2.296    |119,587,038          |2.771    |
+|CohereLabs/aya-expanse-8b       |50,973,632                                                                                                         |2.226    |47,364,187           |1.309    |353,221,932                                                                                    |1.782    |6,614,719                                                                             |3.541    |93,089,697                                                                                                                        |2.187    |112,612,668                                                                                                                                                                             |2.523    |141,262,943          |3.273    |
+|ayayay_tokenizer                |37,094,157                                                                                                         |1.62🤩   |48,288,882           |1.335    |372,587,959                                                                                    |1.88     |4,238,587                                                                             |2.269    |107,331,167                                                                                                                       |2.522    |114,292,191                                                                                                                                                                             |2.561    |133,618,186          |3.096    |
+|Comments                        <td colspan=2> Significant 27 % improvement over the Aya-Expanse baseline; absolute leader in Ukrainian tokenization.<td colspan=2>Tokens-per-word for English rises by less than 4 % compared with the baseline.<td colspan=2>The tokenizer retains strong multilingual capabilities <td colspan=2>Shows significant improvement on QIRIM Cyrillic versus the original aya and other tokenizers<td colspan=2>Russian efficiency drops, owing to the Ukrainian-centric changes, but still beats Qwen.<td colspan=4> Other Cyrillic languages, such as Bulgarian and Belarusian, perform well after the token replacement; Belarusian improves especially noticeably.                           |         |
+## Contents
+- **`tokenizer.json`** Byte‐level tokenizer spec (vocab, merges, model settings).
+- **`tokenizer_utf8.json`** Human-readable dump: UTF-8-decoded sub-tokens and merge rules, for corpus-linguistic inspection.
+- **`malyuk_qirim_tokenizer.json`** Aya-style tokenizer trained on the full Malyuk Ukrainian corpus plus Cyrillic QIRIM (100 : 1 ratio), with min_frequency = 4_000.
+- **`merge_info.json`** Lists the replaced Aya token IDs and the IDs of the added Malyuk tokens in [malyuk_qirim_tokenizer](https://huggingface.co/transhumanist-already-exists/ayayay_tokenizer/blob/main/malyuk_qirim_tokenizer.json).
+- **`tokenizer_config.json`** Configuration metadata.
+- **`special_tokens_map.json`** Mapping of special token (The same with Aya).
+## Initialisation of embeddings for new tokens in Aya-Expanse models
+Some tokens are identical to those in the original Aya-Expanse tokenizer. For the newly added tokens, you can initialise embeddings with tools such as [Focus](https://github.com/konstantinjdobler/focus/tree/main) and [Zett](https://github.com/bminixhofer/zett). The simplest—and often effective—alternative is to initialise the new embeddings randomly and train them with a warm-up schedule.
+## Acknowledgement: Metrics evaluation results provided by @Sofetory.
+## Citation
+**BibTeX:**
+```bibtex
+@misc{zaduha2025post9163,
+  author       = "{Bohdan Didenko}",
+  title        = "{Post \#9163 on Telegram Channel Zaduha}",
+  howpublished = "\url{https://t.me/zaduha/9163}",
+  month        = june,
+  year         = {2025},
+  note         = "[Online; accessed 8 June 2025]"
+}
+```