File size: 7,731 Bytes
83f51f4 5c5ae09 83f51f4 5c5ae09 83f51f4 5c5ae09 83f51f4 5c5ae09 83f51f4 5c5ae09 83f51f4 5c5ae09 83f51f4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
---
license: apache-2.0
datasets:
- Vikhrmodels/GrandMaster-PRO-MAX
- Den4ikAI/ru_sberquad_long_answers
- HuggingFaceH4/ultrachat_200k
- IlyaGusev/gpt_roleplay_realm
- loim/characters_dialogs
- OpenAssistant/oasst1
- OpenAssistant/oasst2
language:
- ru
- en
pipeline_tag: token-classification
tags:
- bpe
- tokenizer
- tokipona
---
<details>
<summary>🇷🇺 Русский...</summary>
# **Русско-английский BPE-токенизатор**
Оптимизированный токенизатор для трехязычных текстов с расширенной поддержкой русской лексики и эффективной обработкой английского и токипона.
## **Ключевые характеристики**
- **Формат**: BPE (Byte-Pair Encoding)
- **Размер словаря**: 12 288 токенов
- **Языки**: Русский + Английский + Токипона (просто потому что могу и это ничего не стоит)
- **Специальные токены**:
`<|endoftext|>`
`<|padding|>`
`<|mask|>`
`<|user|>`
`<|assistant|>`
`<|system|>`
`<|end|>`
`<|en|>`
`<|ru|>`
`<|tok|>`
`<|`
`|>`
</details>
<details>
<summary>🇬🇧 English...</summary>
# **Russian-English BPE tokenizer**
Optimized tokenizer for trilingual texts with extended support for Russian vocabulary and efficient processing of English and Toki pona.
## **Key Features**
- **Format**: BPE (Byte-Pair Encoding)
- **Dictionary size**: 12 288 tokens
- **Languages**: Russian + English + Toki pona (just because I can and it costs nothing)
- **Special tokens**:
`<|endoftext|>`
`<|padding|>`
`<|mask|>`
`<|user|>`
`<|assistant|>`
`<|system|>`
`<|end|>`
`<|en|>`
`<|ru|>`
`<|tok|>`
`<|`
`|>`
</details>
---
<details>
<summary>🧪 Tests...</summary>
### English text (27741474 chars, 4613167 words)
| Tokenizer | Tokens | Compression | Vocab Size | Vocab Used | Vocab Usage % | Avg Token Length | Perfect Detokenization | Tokenization Time (s) | Detokenization Time (s) | Max Length |
|---|---|---|---|---|---|---|---|---|---|---|
| deepseek-ai/DeepSeek-V3 | 5639822 | 1.22 | 128000 | 60979 | 47.6 | 4.9 | 1 | 17.8162 | 3.7699 | 131072 |
| RefalMachine/RuadaptQwen3-32B-Instruct | 5705024 | 1.24 | 146213 | 61580 | 42.1 | 4.9 | 1 | 17.6528 | 4.2012 | 131072 |
| Gensyn/Qwen2.5-1.5B-Instruct | 5708987 | 1.24 | 151643 | 60135 | 39.7 | 4.9 | 1 | 19.3785 | 3.9194 | 131072 |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | 5708988 | 1.24 | 151643 | 60136 | 39.7 | 4.9 | 1 | 18.9563 | 1.6886 | 16384 |
| IlyaGusev/saiga_nemo_12b | 5806480 | 1.26 | 131072 | 56865 | 43.4 | 4.8 | 1 | 18.4329 | 3.1752 | 1024000 |
| openai-community/gpt2 | 5836927 | 1.27 | 50257 | 45466 | 90.5 | 4.8 | 1 | 16.6623 | 2.2766 | 1024 |
| facebook/opt-125m | 5836928 | 1.27 | 50265 | 45467 | 90.5 | 4.8 | 1 | 19.4051 | 3.7256 | 1E+030 |
| Vikhrmodels/Vikhr-YandexGPT-5-Lite-8B-it | 5984540 | 1.3 | 129024 | 51435 | 39.9 | 4.6 | 1 | 14.5142 | 3.0903 | 16384 |
| yandex/YandexGPT-5-Lite-8B-instruct | 5984540 | 1.3 | 129024 | 51435 | 39.9 | 4.6 | 1 | 15.081 | 4.5032 | 1E+030 |
| IlyaGusev/saiga_yandexgpt_8b | 5984540 | 1.3 | 129024 | 51435 | 39.9 | 4.6 | 1 | 15.7957 | 3.6403 | 32768 |
| loim/whiff-tokenizer-12k | 6271746 | 1.36 | 12288 | 9611 | 78.2 | 4.4 | 1 | 41.6606 | 1.5217 | 65536 |
| TinyLlama/TinyLlama-1.1B-Chat-v1.0 | 6655231 | 1.44 | 32000 | 24919 | 77.9 | 4.2 | 1 | 43.1161 | 5.5738 | 2048 |
| ai-forever/ruGPT-3.5-13B | 7154363 | 1.55 | 50257 | 12582 | 25.0 | 3.9 | 0 | 15.711 | 11.2961 | 2048 |
| loim/whiff-tokenizer-8k | 7369398 | 1.6 | 8192 | 7456 | 91.0 | 3.8 | 1 | 32.1512 | 1.6195 | 32768 |
| ai-forever/rugpt3small_based_on_gpt2 | 7749641 | 1.68 | 50257 | 10938 | 21.8 | 3.6 | 0 | 16.4294 | 8.9582 | 2048 |
### Russian text (16315296 chars, 2185925 words)
| Tokenizer | Tokens | Compression | Vocab Size | Vocab Used | Vocab Usage % | Avg Token Length | Perfect Detokenization | Tokenization Time (s) | Detokenization Time (s) | Max Length |
|---|---|---|---|---|---|---|---|---|---|---|
| Vikhrmodels/Vikhr-YandexGPT-5-Lite-8B-it | 3475768 | 1.59 | 129024 | 67971 | 52.7 | 4.7 | 1 | 9.6723 | 1.4114 | 16384 |
| IlyaGusev/saiga_yandexgpt_8b | 3475768 | 1.59 | 129024 | 67971 | 52.7 | 4.7 | 1 | 10.1863 | 1.8007 | 32768 |
| yandex/YandexGPT-5-Lite-8B-instruct | 3475768 | 1.59 | 129024 | 67971 | 52.7 | 4.7 | 1 | 10.3878 | 4.8323 | 1E+030 |
| ai-forever/ruGPT-3.5-13B | 3693945 | 1.69 | 50257 | 43208 | 86.0 | 4.4 | 0 | 16.1615 | 3.9659 | 2048 |
| RefalMachine/RuadaptQwen3-32B-Instruct | 3732533 | 1.71 | 146213 | 52564 | 36.0 | 4.4 | 1 | 16.5792 | 2.4271 | 131072 |
| ai-forever/rugpt3small_based_on_gpt2 | 3801887 | 1.74 | 50257 | 42820 | 85.2 | 4.3 | 0 | 17.1418 | 2.9581 | 2048 |
| loim/whiff-tokenizer-12k | 4070967 | 1.86 | 12288 | 9306 | 75.7 | 4.0 | 1 | 35.0603 | 1.3202 | 65536 |
| deepseek-ai/DeepSeek-V3 | 4806676 | 2.2 | 128000 | 21621 | 16.9 | 3.4 | 1 | 15.8833 | 2.2505 | 131072 |
| IlyaGusev/saiga_nemo_12b | 4926095 | 2.25 | 131072 | 21901 | 16.7 | 3.3 | 1 | 15.2355 | 3.6558 | 1024000 |
| Gensyn/Qwen2.5-1.5B-Instruct | 5411283 | 2.48 | 151643 | 20458 | 13.5 | 3.0 | 1 | 14.6061 | 1.9548 | 131072 |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | 5411284 | 2.48 | 151643 | 20459 | 13.5 | 3.0 | 1 | 16.4851 | 1.5277 | 16384 |
| TinyLlama/TinyLlama-1.1B-Chat-v1.0 | 5986567 | 2.74 | 32000 | 13454 | 42.0 | 2.7 | 1 | 20.6121 | 1.9489 | 2048 |
| loim/whiff-tokenizer-8k | 6090683 | 2.79 | 8192 | 5749 | 70.2 | 2.7 | 1 | 24.6047 | 1.4503 | 32768 |
| openai-community/gpt2 | 16931837 | 7.75 | 50257 | 13818 | 27.5 | 1.0 | 1 | 19.4 | 6.16 | 1024 |
| facebook/opt-125m | 16931838 | 7.75 | 50265 | 13819 | 27.5 | 1.0 | 1 | 22.1165 | 4.2726 | 1E+030 |
### Toki pona text (3663780 chars, 831463 words)
| Tokenizer | Tokens | Compression | Vocab Size | Vocab Used | Vocab Usage % | Avg Token Length | Perfect Detokenization | Tokenization Time (s) | Detokenization Time (s) | Max Length |
|---|---|---|---|---|---|---|---|---|---|---|
| loim/whiff-tokenizer-12k | 1144322 | 1.38 | 12288 | 2927 | 23.8 | 3.2 | 1 | 4.145 | 0.2371 | 65536 |
| IlyaGusev/saiga_nemo_12b | 1332599 | 1.6 | 131072 | 8428 | 6.4 | 2.7 | 1 | 2.7613 | 0.7956 | 1024000 |
| deepseek-ai/DeepSeek-V3 | 1343359 | 1.62 | 128000 | 8870 | 6.9 | 2.7 | 1 | 2.6998 | 0.4471 | 131072 |
| RefalMachine/RuadaptQwen3-32B-Instruct | 1396348 | 1.68 | 146213 | 7546 | 5.2 | 2.6 | 1 | 2.3745 | 2.2573 | 131072 |
| Gensyn/Qwen2.5-1.5B-Instruct | 1393944 | 1.68 | 151643 | 7931 | 5.2 | 2.6 | 1 | 2.181 | 0.3505 | 131072 |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | 1393945 | 1.68 | 151643 | 7932 | 5.2 | 2.6 | 1 | 2.6367 | 0.3489 | 16384 |
| Vikhrmodels/Vikhr-YandexGPT-5-Lite-8B-it | 1481531 | 1.78 | 129024 | 7306 | 5.7 | 2.5 | 1 | 2.2853 | 1.3855 | 16384 |
| yandex/YandexGPT-5-Lite-8B-instruct | 1481531 | 1.78 | 129024 | 7306 | 5.7 | 2.5 | 1 | 2.359 | 1.2527 | 1E+030 |
| IlyaGusev/saiga_yandexgpt_8b | 1481531 | 1.78 | 129024 | 7306 | 5.7 | 2.5 | 1 | 2.5027 | 2.1723 | 32768 |
| TinyLlama/TinyLlama-1.1B-Chat-v1.0 | 1536792 | 1.85 | 32000 | 6322 | 19.8 | 2.4 | 1 | 4.2253 | 0.6623 | 2048 |
| openai-community/gpt2 | 1550846 | 1.87 | 50257 | 6680 | 13.3 | 2.4 | 1 | 2.7572 | 0.7449 | 1024 |
| facebook/opt-125m | 1550847 | 1.87 | 50265 | 6681 | 13.3 | 2.4 | 1 | 2.4144 | 0.6391 | 1E+030 |
| ai-forever/ruGPT-3.5-13B | 1828262 | 2.2 | 50257 | 3881 | 7.7 | 2.0 | 0 | 2.1597 | 0.7194 | 2048 |
| ai-forever/rugpt3small_based_on_gpt2 | 1925501 | 2.32 | 50257 | 3697 | 7.4 | 1.9 | 0 | 1.9954 | 0.8262 | 2048 |
| loim/whiff-tokenizer-8k | 2123707 | 2.55 | 8192 | 2709 | 33.1 | 1.7 | 1 | 2.4541 | 0.3799 | 32768 |
</details> |