🇷🇺 Русский...
Русско-английский BPE-токенизатор
Оптимизированный токенизатор для трехязычных текстов с расширенной поддержкой русской лексики и эффективной обработкой английского и токипона.
Ключевые характеристики
- Формат: BPE (Byte-Pair Encoding)
- Размер словаря: 12 288 токенов
- Языки: Русский + Английский + Токипона (просто потому что могу и это ничего не стоит)
- Специальные токены:
<|endoftext|>
<|padding|>
<|mask|>
<|user|>
<|assistant|>
<|system|>
<|end|>
<|en|>
<|ru|>
<|tok|>
<|
|>
🇬🇧 English...
Russian-English BPE tokenizer
Optimized tokenizer for trilingual texts with extended support for Russian vocabulary and efficient processing of English and Toki pona.
Key Features
- Format: BPE (Byte-Pair Encoding)
- Dictionary size: 12 288 tokens
- Languages: Russian + English + Toki pona (just because I can and it costs nothing)
- Special tokens:
<|endoftext|>
<|padding|>
<|mask|>
<|user|>
<|assistant|>
<|system|>
<|end|>
<|en|>
<|ru|>
<|tok|>
<|
|>
🧪 Tests...
English text (27741474 chars, 4613167 words)
Tokenizer |
Tokens |
Compression |
Vocab Size |
Vocab Used |
Vocab Usage % |
Avg Token Length |
Perfect Detokenization |
Tokenization Time (s) |
Detokenization Time (s) |
Max Length |
deepseek-ai/DeepSeek-V3 |
5639822 |
1.22 |
128000 |
60979 |
47.6 |
4.9 |
1 |
17.8162 |
3.7699 |
131072 |
RefalMachine/RuadaptQwen3-32B-Instruct |
5705024 |
1.24 |
146213 |
61580 |
42.1 |
4.9 |
1 |
17.6528 |
4.2012 |
131072 |
Gensyn/Qwen2.5-1.5B-Instruct |
5708987 |
1.24 |
151643 |
60135 |
39.7 |
4.9 |
1 |
19.3785 |
3.9194 |
131072 |
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B |
5708988 |
1.24 |
151643 |
60136 |
39.7 |
4.9 |
1 |
18.9563 |
1.6886 |
16384 |
IlyaGusev/saiga_nemo_12b |
5806480 |
1.26 |
131072 |
56865 |
43.4 |
4.8 |
1 |
18.4329 |
3.1752 |
1024000 |
openai-community/gpt2 |
5836927 |
1.27 |
50257 |
45466 |
90.5 |
4.8 |
1 |
16.6623 |
2.2766 |
1024 |
facebook/opt-125m |
5836928 |
1.27 |
50265 |
45467 |
90.5 |
4.8 |
1 |
19.4051 |
3.7256 |
1E+030 |
Vikhrmodels/Vikhr-YandexGPT-5-Lite-8B-it |
5984540 |
1.3 |
129024 |
51435 |
39.9 |
4.6 |
1 |
14.5142 |
3.0903 |
16384 |
yandex/YandexGPT-5-Lite-8B-instruct |
5984540 |
1.3 |
129024 |
51435 |
39.9 |
4.6 |
1 |
15.081 |
4.5032 |
1E+030 |
IlyaGusev/saiga_yandexgpt_8b |
5984540 |
1.3 |
129024 |
51435 |
39.9 |
4.6 |
1 |
15.7957 |
3.6403 |
32768 |
loim/whiff-tokenizer-12k |
6271746 |
1.36 |
12288 |
9611 |
78.2 |
4.4 |
1 |
41.6606 |
1.5217 |
65536 |
TinyLlama/TinyLlama-1.1B-Chat-v1.0 |
6655231 |
1.44 |
32000 |
24919 |
77.9 |
4.2 |
1 |
43.1161 |
5.5738 |
2048 |
ai-forever/ruGPT-3.5-13B |
7154363 |
1.55 |
50257 |
12582 |
25.0 |
3.9 |
0 |
15.711 |
11.2961 |
2048 |
loim/whiff-tokenizer-8k |
7369398 |
1.6 |
8192 |
7456 |
91.0 |
3.8 |
1 |
32.1512 |
1.6195 |
32768 |
ai-forever/rugpt3small_based_on_gpt2 |
7749641 |
1.68 |
50257 |
10938 |
21.8 |
3.6 |
0 |
16.4294 |
8.9582 |
2048 |
Russian text (16315296 chars, 2185925 words)
Tokenizer |
Tokens |
Compression |
Vocab Size |
Vocab Used |
Vocab Usage % |
Avg Token Length |
Perfect Detokenization |
Tokenization Time (s) |
Detokenization Time (s) |
Max Length |
Vikhrmodels/Vikhr-YandexGPT-5-Lite-8B-it |
3475768 |
1.59 |
129024 |
67971 |
52.7 |
4.7 |
1 |
9.6723 |
1.4114 |
16384 |
IlyaGusev/saiga_yandexgpt_8b |
3475768 |
1.59 |
129024 |
67971 |
52.7 |
4.7 |
1 |
10.1863 |
1.8007 |
32768 |
yandex/YandexGPT-5-Lite-8B-instruct |
3475768 |
1.59 |
129024 |
67971 |
52.7 |
4.7 |
1 |
10.3878 |
4.8323 |
1E+030 |
ai-forever/ruGPT-3.5-13B |
3693945 |
1.69 |
50257 |
43208 |
86.0 |
4.4 |
0 |
16.1615 |
3.9659 |
2048 |
RefalMachine/RuadaptQwen3-32B-Instruct |
3732533 |
1.71 |
146213 |
52564 |
36.0 |
4.4 |
1 |
16.5792 |
2.4271 |
131072 |
ai-forever/rugpt3small_based_on_gpt2 |
3801887 |
1.74 |
50257 |
42820 |
85.2 |
4.3 |
0 |
17.1418 |
2.9581 |
2048 |
loim/whiff-tokenizer-12k |
4070967 |
1.86 |
12288 |
9306 |
75.7 |
4.0 |
1 |
35.0603 |
1.3202 |
65536 |
deepseek-ai/DeepSeek-V3 |
4806676 |
2.2 |
128000 |
21621 |
16.9 |
3.4 |
1 |
15.8833 |
2.2505 |
131072 |
IlyaGusev/saiga_nemo_12b |
4926095 |
2.25 |
131072 |
21901 |
16.7 |
3.3 |
1 |
15.2355 |
3.6558 |
1024000 |
Gensyn/Qwen2.5-1.5B-Instruct |
5411283 |
2.48 |
151643 |
20458 |
13.5 |
3.0 |
1 |
14.6061 |
1.9548 |
131072 |
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B |
5411284 |
2.48 |
151643 |
20459 |
13.5 |
3.0 |
1 |
16.4851 |
1.5277 |
16384 |
TinyLlama/TinyLlama-1.1B-Chat-v1.0 |
5986567 |
2.74 |
32000 |
13454 |
42.0 |
2.7 |
1 |
20.6121 |
1.9489 |
2048 |
loim/whiff-tokenizer-8k |
6090683 |
2.79 |
8192 |
5749 |
70.2 |
2.7 |
1 |
24.6047 |
1.4503 |
32768 |
openai-community/gpt2 |
16931837 |
7.75 |
50257 |
13818 |
27.5 |
1.0 |
1 |
19.4 |
6.16 |
1024 |
facebook/opt-125m |
16931838 |
7.75 |
50265 |
13819 |
27.5 |
1.0 |
1 |
22.1165 |
4.2726 |
1E+030 |
Toki pona text (3663780 chars, 831463 words)
Tokenizer |
Tokens |
Compression |
Vocab Size |
Vocab Used |
Vocab Usage % |
Avg Token Length |
Perfect Detokenization |
Tokenization Time (s) |
Detokenization Time (s) |
Max Length |
loim/whiff-tokenizer-12k |
1144322 |
1.38 |
12288 |
2927 |
23.8 |
3.2 |
1 |
4.145 |
0.2371 |
65536 |
IlyaGusev/saiga_nemo_12b |
1332599 |
1.6 |
131072 |
8428 |
6.4 |
2.7 |
1 |
2.7613 |
0.7956 |
1024000 |
deepseek-ai/DeepSeek-V3 |
1343359 |
1.62 |
128000 |
8870 |
6.9 |
2.7 |
1 |
2.6998 |
0.4471 |
131072 |
RefalMachine/RuadaptQwen3-32B-Instruct |
1396348 |
1.68 |
146213 |
7546 |
5.2 |
2.6 |
1 |
2.3745 |
2.2573 |
131072 |
Gensyn/Qwen2.5-1.5B-Instruct |
1393944 |
1.68 |
151643 |
7931 |
5.2 |
2.6 |
1 |
2.181 |
0.3505 |
131072 |
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B |
1393945 |
1.68 |
151643 |
7932 |
5.2 |
2.6 |
1 |
2.6367 |
0.3489 |
16384 |
Vikhrmodels/Vikhr-YandexGPT-5-Lite-8B-it |
1481531 |
1.78 |
129024 |
7306 |
5.7 |
2.5 |
1 |
2.2853 |
1.3855 |
16384 |
yandex/YandexGPT-5-Lite-8B-instruct |
1481531 |
1.78 |
129024 |
7306 |
5.7 |
2.5 |
1 |
2.359 |
1.2527 |
1E+030 |
IlyaGusev/saiga_yandexgpt_8b |
1481531 |
1.78 |
129024 |
7306 |
5.7 |
2.5 |
1 |
2.5027 |
2.1723 |
32768 |
TinyLlama/TinyLlama-1.1B-Chat-v1.0 |
1536792 |
1.85 |
32000 |
6322 |
19.8 |
2.4 |
1 |
4.2253 |
0.6623 |
2048 |
openai-community/gpt2 |
1550846 |
1.87 |
50257 |
6680 |
13.3 |
2.4 |
1 |
2.7572 |
0.7449 |
1024 |
facebook/opt-125m |
1550847 |
1.87 |
50265 |
6681 |
13.3 |
2.4 |
1 |
2.4144 |
0.6391 |
1E+030 |
ai-forever/ruGPT-3.5-13B |
1828262 |
2.2 |
50257 |
3881 |
7.7 |
2.0 |
0 |
2.1597 |
0.7194 |
2048 |
ai-forever/rugpt3small_based_on_gpt2 |
1925501 |
2.32 |
50257 |
3697 |
7.4 |
1.9 |
0 |
1.9954 |
0.8262 |
2048 |
loim/whiff-tokenizer-8k |
2123707 |
2.55 |
8192 |
2709 |
33.1 |
1.7 |
1 |
2.4541 |
0.3799 |
32768 |