File size: 7,731 Bytes
83f51f4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5c5ae09
83f51f4
 
5c5ae09
83f51f4
 
 
 
 
 
 
 
 
 
 
5c5ae09
83f51f4
 
 
 
 
5c5ae09
83f51f4
 
 
 
 
 
5c5ae09
83f51f4
 
 
 
 
 
 
 
 
 
 
 
 
5c5ae09
83f51f4
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
license: apache-2.0
datasets:
- Vikhrmodels/GrandMaster-PRO-MAX
- Den4ikAI/ru_sberquad_long_answers
- HuggingFaceH4/ultrachat_200k
- IlyaGusev/gpt_roleplay_realm
- loim/characters_dialogs
- OpenAssistant/oasst1
- OpenAssistant/oasst2
language:
- ru
- en
pipeline_tag: token-classification
tags:
- bpe
- tokenizer
- tokipona
---

<details>
<summary>🇷🇺 Русский...</summary>

# **Русско-английский BPE-токенизатор**  
Оптимизированный токенизатор для трехязычных текстов с расширенной поддержкой русской лексики и эффективной обработкой английского и токипона.

## **Ключевые характеристики**
- **Формат**: BPE (Byte-Pair Encoding)
- **Размер словаря**: 12 288 токенов
- **Языки**: Русский + Английский + Токипона (просто потому что могу и это ничего не стоит)
- **Специальные токены**:  
`<|endoftext|>`  
`<|padding|>`  
`<|mask|>`  
`<|user|>`  
`<|assistant|>`  
`<|system|>`  
`<|end|>`  
`<|en|>`  
`<|ru|>`  
`<|tok|>`  
`<|`  
`|>`  

</details>


<details>
<summary>🇬🇧 English...</summary>

# **Russian-English BPE tokenizer**  
Optimized tokenizer for trilingual texts with extended support for Russian vocabulary and efficient processing of English and Toki pona.

## **Key Features**
- **Format**: BPE (Byte-Pair Encoding)
- **Dictionary size**: 12 288 tokens
- **Languages**: Russian + English + Toki pona (just because I can and it costs nothing)
- **Special tokens**:
`<|endoftext|>`  
`<|padding|>`  
`<|mask|>`  
`<|user|>`  
`<|assistant|>`  
`<|system|>`  
`<|end|>`  
`<|en|>`  
`<|ru|>`  
`<|tok|>`  
`<|`  
`|>`  

</details>

---

<details>
<summary>🧪 Tests...</summary>  
  
### English text (27741474 chars, 4613167 words)
| Tokenizer | Tokens | Compression | Vocab Size | Vocab Used | Vocab Usage % | Avg Token Length | Perfect Detokenization | Tokenization Time (s) | Detokenization Time (s) | Max Length |
|---|---|---|---|---|---|---|---|---|---|---|
| deepseek-ai/DeepSeek-V3 | 5639822 | 1.22 | 128000 | 60979 | 47.6 | 4.9 | 1 | 17.8162 | 3.7699 | 131072 |
| RefalMachine/RuadaptQwen3-32B-Instruct | 5705024 | 1.24 | 146213 | 61580 | 42.1 | 4.9 | 1 | 17.6528 | 4.2012 | 131072 |
| Gensyn/Qwen2.5-1.5B-Instruct | 5708987 | 1.24 | 151643 | 60135 | 39.7 | 4.9 | 1 | 19.3785 | 3.9194 | 131072 |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | 5708988 | 1.24 | 151643 | 60136 | 39.7 | 4.9 | 1 | 18.9563 | 1.6886 | 16384 |
| IlyaGusev/saiga_nemo_12b | 5806480 | 1.26 | 131072 | 56865 | 43.4 | 4.8 | 1 | 18.4329 | 3.1752 | 1024000 |
| openai-community/gpt2 | 5836927 | 1.27 | 50257 | 45466 | 90.5 | 4.8 | 1 | 16.6623 | 2.2766 | 1024 |
| facebook/opt-125m | 5836928 | 1.27 | 50265 | 45467 | 90.5 | 4.8 | 1 | 19.4051 | 3.7256 | 1E+030 |
| Vikhrmodels/Vikhr-YandexGPT-5-Lite-8B-it | 5984540 | 1.3 | 129024 | 51435 | 39.9 | 4.6 | 1 | 14.5142 | 3.0903 | 16384 |
| yandex/YandexGPT-5-Lite-8B-instruct | 5984540 | 1.3 | 129024 | 51435 | 39.9 | 4.6 | 1 | 15.081 | 4.5032 | 1E+030 |
| IlyaGusev/saiga_yandexgpt_8b | 5984540 | 1.3 | 129024 | 51435 | 39.9 | 4.6 | 1 | 15.7957 | 3.6403 | 32768 |
| loim/whiff-tokenizer-12k | 6271746 | 1.36 | 12288 | 9611 | 78.2 | 4.4 | 1 | 41.6606 | 1.5217 | 65536 |
| TinyLlama/TinyLlama-1.1B-Chat-v1.0 | 6655231 | 1.44 | 32000 | 24919 | 77.9 | 4.2 | 1 | 43.1161 | 5.5738 | 2048 |
| ai-forever/ruGPT-3.5-13B | 7154363 | 1.55 | 50257 | 12582 | 25.0 | 3.9 | 0 | 15.711 | 11.2961 | 2048 |
| loim/whiff-tokenizer-8k | 7369398 | 1.6 | 8192 | 7456 | 91.0 | 3.8 | 1 | 32.1512 | 1.6195 | 32768 |
| ai-forever/rugpt3small_based_on_gpt2 | 7749641 | 1.68 | 50257 | 10938 | 21.8 | 3.6 | 0 | 16.4294 | 8.9582 | 2048 |

### Russian text (16315296 chars, 2185925 words)
| Tokenizer | Tokens | Compression | Vocab Size | Vocab Used | Vocab Usage % | Avg Token Length | Perfect Detokenization | Tokenization Time (s) | Detokenization Time (s) | Max Length |
|---|---|---|---|---|---|---|---|---|---|---|
| Vikhrmodels/Vikhr-YandexGPT-5-Lite-8B-it | 3475768 | 1.59 | 129024 | 67971 | 52.7 | 4.7 | 1 | 9.6723 | 1.4114 | 16384 |
| IlyaGusev/saiga_yandexgpt_8b | 3475768 | 1.59 | 129024 | 67971 | 52.7 | 4.7 | 1 | 10.1863 | 1.8007 | 32768 |
| yandex/YandexGPT-5-Lite-8B-instruct | 3475768 | 1.59 | 129024 | 67971 | 52.7 | 4.7 | 1 | 10.3878 | 4.8323 | 1E+030 |
| ai-forever/ruGPT-3.5-13B | 3693945 | 1.69 | 50257 | 43208 | 86.0 | 4.4 | 0 | 16.1615 | 3.9659 | 2048 |
| RefalMachine/RuadaptQwen3-32B-Instruct | 3732533 | 1.71 | 146213 | 52564 | 36.0 | 4.4 | 1 | 16.5792 | 2.4271 | 131072 |
| ai-forever/rugpt3small_based_on_gpt2 | 3801887 | 1.74 | 50257 | 42820 | 85.2 | 4.3 | 0 | 17.1418 | 2.9581 | 2048 |
| loim/whiff-tokenizer-12k | 4070967 | 1.86 | 12288 | 9306 | 75.7 | 4.0 | 1 | 35.0603 | 1.3202 | 65536 |
| deepseek-ai/DeepSeek-V3 | 4806676 | 2.2 | 128000 | 21621 | 16.9 | 3.4 | 1 | 15.8833 | 2.2505 | 131072 |
| IlyaGusev/saiga_nemo_12b | 4926095 | 2.25 | 131072 | 21901 | 16.7 | 3.3 | 1 | 15.2355 | 3.6558 | 1024000 |
| Gensyn/Qwen2.5-1.5B-Instruct | 5411283 | 2.48 | 151643 | 20458 | 13.5 | 3.0 | 1 | 14.6061 | 1.9548 | 131072 |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | 5411284 | 2.48 | 151643 | 20459 | 13.5 | 3.0 | 1 | 16.4851 | 1.5277 | 16384 |
| TinyLlama/TinyLlama-1.1B-Chat-v1.0 | 5986567 | 2.74 | 32000 | 13454 | 42.0 | 2.7 | 1 | 20.6121 | 1.9489 | 2048 |
| loim/whiff-tokenizer-8k | 6090683 | 2.79 | 8192 | 5749 | 70.2 | 2.7 | 1 | 24.6047 | 1.4503 | 32768 |
| openai-community/gpt2 | 16931837 | 7.75 | 50257 | 13818 | 27.5 | 1.0 | 1 | 19.4 | 6.16 | 1024 |
| facebook/opt-125m | 16931838 | 7.75 | 50265 | 13819 | 27.5 | 1.0 | 1 | 22.1165 | 4.2726 | 1E+030 |

### Toki pona text (3663780 chars, 831463 words)
| Tokenizer | Tokens | Compression | Vocab Size | Vocab Used | Vocab Usage % | Avg Token Length | Perfect Detokenization | Tokenization Time (s) | Detokenization Time (s) | Max Length |
|---|---|---|---|---|---|---|---|---|---|---|
| loim/whiff-tokenizer-12k | 1144322 | 1.38 | 12288 | 2927 | 23.8 | 3.2 | 1 | 4.145 | 0.2371 | 65536 |
| IlyaGusev/saiga_nemo_12b | 1332599 | 1.6 | 131072 | 8428 | 6.4 | 2.7 | 1 | 2.7613 | 0.7956 | 1024000 |
| deepseek-ai/DeepSeek-V3 | 1343359 | 1.62 | 128000 | 8870 | 6.9 | 2.7 | 1 | 2.6998 | 0.4471 | 131072 |
| RefalMachine/RuadaptQwen3-32B-Instruct | 1396348 | 1.68 | 146213 | 7546 | 5.2 | 2.6 | 1 | 2.3745 | 2.2573 | 131072 |
| Gensyn/Qwen2.5-1.5B-Instruct | 1393944 | 1.68 | 151643 | 7931 | 5.2 | 2.6 | 1 | 2.181 | 0.3505 | 131072 |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | 1393945 | 1.68 | 151643 | 7932 | 5.2 | 2.6 | 1 | 2.6367 | 0.3489 | 16384 |
| Vikhrmodels/Vikhr-YandexGPT-5-Lite-8B-it | 1481531 | 1.78 | 129024 | 7306 | 5.7 | 2.5 | 1 | 2.2853 | 1.3855 | 16384 |
| yandex/YandexGPT-5-Lite-8B-instruct | 1481531 | 1.78 | 129024 | 7306 | 5.7 | 2.5 | 1 | 2.359 | 1.2527 | 1E+030 |
| IlyaGusev/saiga_yandexgpt_8b | 1481531 | 1.78 | 129024 | 7306 | 5.7 | 2.5 | 1 | 2.5027 | 2.1723 | 32768 |
| TinyLlama/TinyLlama-1.1B-Chat-v1.0 | 1536792 | 1.85 | 32000 | 6322 | 19.8 | 2.4 | 1 | 4.2253 | 0.6623 | 2048 |
| openai-community/gpt2 | 1550846 | 1.87 | 50257 | 6680 | 13.3 | 2.4 | 1 | 2.7572 | 0.7449 | 1024 |
| facebook/opt-125m | 1550847 | 1.87 | 50265 | 6681 | 13.3 | 2.4 | 1 | 2.4144 | 0.6391 | 1E+030 |
| ai-forever/ruGPT-3.5-13B | 1828262 | 2.2 | 50257 | 3881 | 7.7 | 2.0 | 0 | 2.1597 | 0.7194 | 2048 |
| ai-forever/rugpt3small_based_on_gpt2 | 1925501 | 2.32 | 50257 | 3697 | 7.4 | 1.9 | 0 | 1.9954 | 0.8262 | 2048 |
| loim/whiff-tokenizer-8k | 2123707 | 2.55 | 8192 | 2709 | 33.1 | 1.7 | 1 | 2.4541 | 0.3799 | 32768 |

</details>