transhumanist-already-exists
/

ayayay-tokenizer

Transformers

aya-tokenizer

ukraine

corpus-linguistics

Model card Files Files and versions Community

transhumanist-already-exists commited on Jun 8

Commit

50b53cc

verified ·

1 Parent(s): 4241aed

Update README.md

Browse files

Files changed (1) hide show

README.md +7 -7

README.md CHANGED Viewed

@@ -65,7 +65,7 @@ print(toks.input_ids) # [123903, 175118, 167580, 196099] - only 4 tokens 💪
 Acknowledgement: evaluation results provided by [@Sofetory](https://huggingface.co/Sofetory).
 ||lang-uk/malyuk                                                                                              |100k texts|allenai/c4(en)| 100k texts|allenai/c4(es, fr, it, de) | 400k texts                                                              |QIRIM/crh_monocorpus(Cyrillic) | 94 texts                                                   |allenai/c4(ru)                                                                                                             | 100k texts|allenai/c4(bg)                                                                                                                                                                    | 100k texts|allenai/c4(be)| 100k texts|
 |--------------------------------|-------------------------------------------------------------------------------------------------------------------|---------|---------------------|---------|-----------------------------------------------------------------------------------------------|---------|--------------------------------------------------------------------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|---------------------|---------|
-|words count                      <td colspan=2>22,898,164                                                                                                                 |36,170,971           |         |198,173,216                                                                                    |         |1,868,259                                                                             |         |42,557,519                                                                                                                        |         |44,627,199                                                                                                                                                                              |         |43,153,645           |         |
 ||||||||||||||||
 |tokenizers             |tokens                                                                                                             |toks/word|tokens               |toks/word|tokens                                                                                         |toks/word|tokens                                                                                |toks/word|tokens                                                                                                                            |toks/word|tokens                                                                                                                                                                                  |toks/word|tokens               |toks/word|
 |google/gemma-3-12b-it           |57,388,402                                                                                                         |2.506    |47,285,432           |1.307    |354,241,840                                                                                    |1.788    |6,240,944                                                                             |3.341    |95,520,817                                                                                                                        |2.245    |103,950,626                                                                                                                                                                             |2.329    |131,398,147          |3.045    |
@@ -79,17 +79,17 @@ Acknowledgement: evaluation results provided by [@Sofetory](https://huggingface.
 ## Contents
-- **`tokenizer.json`** Byte‐level tokenizer spec (vocab, merges, model settings).
-- **`tokenizer_utf8.json`** Human-readable dump: UTF-8-decoded sub-tokens and merge rules, for corpus-linguistic inspection.
-- **`malyuk_qirim_tokenizer.json`** Aya-style tokenizer trained on the full Malyuk Ukrainian corpus plus Cyrillic QIRIM (100 : 1 ratio), with min_frequency = 4_000.
-- **`merge_info.json`** Lists the replaced Aya token IDs and the IDs of the added Malyuk tokens in [malyuk_qirim_tokenizer](https://huggingface.co/transhumanist-already-exists/ayayay_tokenizer/blob/main/malyuk_qirim_tokenizer.json).
-- **`tokenizer_config.json`** Configuration metadata.
-- **`special_tokens_map.json`** Mapping of special token (The same with Aya).
 ## Initialisation of embeddings for new tokens in Aya-Expanse models
 Some tokens are identical to those in the original Aya-Expanse tokenizer. For the newly added tokens, you can initialise embeddings with tools such as [Focus](https://github.com/konstantinjdobler/focus/tree/main) and [Zett](https://github.com/bminixhofer/zett). The simplest—and often effective—alternative is to initialise the new embeddings randomly and train them with a warm-up schedule.

 Acknowledgement: evaluation results provided by [@Sofetory](https://huggingface.co/Sofetory).
 ||lang-uk/malyuk                                                                                              |100k texts|allenai/c4(en)| 100k texts|allenai/c4(es, fr, it, de) | 400k texts                                                              |QIRIM/crh_monocorpus(Cyrillic) | 94 texts                                                   |allenai/c4(ru)                                                                                                             | 100k texts|allenai/c4(bg)                                                                                                                                                                    | 100k texts|allenai/c4(be)| 100k texts|
 |--------------------------------|-------------------------------------------------------------------------------------------------------------------|---------|---------------------|---------|-----------------------------------------------------------------------------------------------|---------|--------------------------------------------------------------------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|---------------------|---------|
+|words count                      <td colspan=2>22,898,164                                                                                                                 <td colspan=2>36,170,971         <td colspan=2>198,173,216        <td colspan=2>1,868,259             <td colspan=2>42,557,519                <td colspan=2>44,627,199                           <td colspan=2>43,153,645                |
 ||||||||||||||||
 |tokenizers             |tokens                                                                                                             |toks/word|tokens               |toks/word|tokens                                                                                         |toks/word|tokens                                                                                |toks/word|tokens                                                                                                                            |toks/word|tokens                                                                                                                                                                                  |toks/word|tokens               |toks/word|
 |google/gemma-3-12b-it           |57,388,402                                                                                                         |2.506    |47,285,432           |1.307    |354,241,840                                                                                    |1.788    |6,240,944                                                                             |3.341    |95,520,817                                                                                                                        |2.245    |103,950,626                                                                                                                                                                             |2.329    |131,398,147          |3.045    |
 ## Contents
+- [tokenizer.json](tokenizer.json) Byte‐level tokenizer spec (vocab, merges, model settings).
+- [tokenizer_utf8.json](tokenizer_utf8.json) Human-readable dump: UTF-8-decoded sub-tokens and merge rules, for corpus-linguistic inspection.
+- [malyuk_qirim_tokenizer.json](malyuk_qirim_tokenizer.json) Aya-style tokenizer trained on the full Malyuk Ukrainian corpus plus Cyrillic QIRIM (100 : 1 ratio), with min_frequency = 4_000.
+- [merge_info.json](merge_info.json) Lists the replaced Aya token IDs and the IDs of the added Malyuk tokens in [malyuk_qirim_tokenizer](https://huggingface.co/transhumanist-already-exists/ayayay_tokenizer/blob/main/malyuk_qirim_tokenizer.json).
+- [tokenizer_config.json](tokenizer_config.json) Configuration metadata.
+- [special_tokens_map.json](special_tokens_map.json) Mapping of special token (The same with Aya).
 ## Initialisation of embeddings for new tokens in Aya-Expanse models
 Some tokens are identical to those in the original Aya-Expanse tokenizer. For the newly added tokens, you can initialise embeddings with tools such as [Focus](https://github.com/konstantinjdobler/focus/tree/main) and [Zett](https://github.com/bminixhofer/zett). The simplest—and often effective—alternative is to initialise the new embeddings randomly and train them with a warm-up schedule.