Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,136 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- Vikhrmodels/GrandMaster-PRO-MAX
|
5 |
+
- Den4ikAI/ru_sberquad_long_answers
|
6 |
+
- HuggingFaceH4/ultrachat_200k
|
7 |
+
- IlyaGusev/gpt_roleplay_realm
|
8 |
+
- loim/characters_dialogs
|
9 |
+
- OpenAssistant/oasst1
|
10 |
+
- OpenAssistant/oasst2
|
11 |
+
language:
|
12 |
+
- ru
|
13 |
+
- en
|
14 |
+
pipeline_tag: token-classification
|
15 |
+
tags:
|
16 |
+
- bpe
|
17 |
+
- tokenizer
|
18 |
+
- tokipona
|
19 |
+
---
|
20 |
+
|
21 |
+
<details>
|
22 |
+
<summary>🇷🇺 Русский...</summary>
|
23 |
+
|
24 |
+
# **Русско-английский BPE-токенизатор**
|
25 |
+
Оптимизированный токенизатор для трехязычных текстов с расширенной поддержкой русской лексики и эффективной обработкой английского и токипона.
|
26 |
+
|
27 |
+
## **Ключевые характеристики**
|
28 |
+
- **Формат**: BPE (Byte-Pair Encoding)
|
29 |
+
- **Размер словаря**: 12 288 токенов
|
30 |
+
- **Языки**: Русский + Английский + Токипона (просто потому что могу и это ничего не стоит)
|
31 |
+
- **Специальные токены**:
|
32 |
+
`<|endoftext|>`
|
33 |
+
`<|padding|>`
|
34 |
+
`<|mask|>`
|
35 |
+
`<|user|>`
|
36 |
+
`<|assistant|>`
|
37 |
+
`<|system|>`
|
38 |
+
`<|end|>`
|
39 |
+
`<|en|>`
|
40 |
+
`<|ru|>`
|
41 |
+
`<|tok|>`
|
42 |
+
`<|`
|
43 |
+
`|>`
|
44 |
+
|
45 |
+
</details>
|
46 |
+
|
47 |
+
|
48 |
+
<details>
|
49 |
+
<summary>🇬🇧 English...</summary>
|
50 |
+
|
51 |
+
# **Russian-English BPE tokenizer**
|
52 |
+
Optimized tokenizer for trilingual texts with extended support for Russian vocabulary and efficient processing of English and Toki pona.
|
53 |
+
|
54 |
+
## **Key Features**
|
55 |
+
- **Format**: BPE (Byte-Pair Encoding)
|
56 |
+
- **Dictionary size**: 12 288 tokens
|
57 |
+
- **Languages**: Russian + English + Toki pona (just because I can and it costs nothing)
|
58 |
+
- **Special tokens**:
|
59 |
+
`<|endoftext|>`
|
60 |
+
`<|padding|>`
|
61 |
+
`<|mask|>`
|
62 |
+
`<|user|>`
|
63 |
+
`<|assistant|>`
|
64 |
+
`<|system|>`
|
65 |
+
`<|end|>`
|
66 |
+
`<|en|>`
|
67 |
+
`<|ru|>`
|
68 |
+
`<|tok|>`
|
69 |
+
`<|`
|
70 |
+
`|>`
|
71 |
+
|
72 |
+
</details>
|
73 |
+
|
74 |
+
---
|
75 |
+
|
76 |
+
<details>
|
77 |
+
<summary>🧪 Tests...</summary>
|
78 |
+
|
79 |
+
### English text (27741474 chars, 4613167 words)
|
80 |
+
| Tokenizer | Tokens | Compression | Vocab Size | Vocab Used | Vocab Usage % | Avg Token Length | Perfect Detokenization | Tokenization Time (s) | Detokenization Time (s) | Max Length |
|
81 |
+
|---|---|---|---|---|---|---|---|---|---|---|
|
82 |
+
| deepseek-ai/DeepSeek-V3 | 5639822 | 1.22 | 128000 | 60979 | 47.6 | 4.9 | 1 | 17.8162 | 3.7699 | 131072 |
|
83 |
+
| RefalMachine/RuadaptQwen3-32B-Instruct | 5705024 | 1.24 | 146213 | 61580 | 42.1 | 4.9 | 1 | 17.6528 | 4.2012 | 131072 |
|
84 |
+
| Gensyn/Qwen2.5-1.5B-Instruct | 5708987 | 1.24 | 151643 | 60135 | 39.7 | 4.9 | 1 | 19.3785 | 3.9194 | 131072 |
|
85 |
+
| deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | 5708988 | 1.24 | 151643 | 60136 | 39.7 | 4.9 | 1 | 18.9563 | 1.6886 | 16384 |
|
86 |
+
| IlyaGusev/saiga_nemo_12b | 5806480 | 1.26 | 131072 | 56865 | 43.4 | 4.8 | 1 | 18.4329 | 3.1752 | 1024000 |
|
87 |
+
| openai-community/gpt2 | 5836927 | 1.27 | 50257 | 45466 | 90.5 | 4.8 | 1 | 16.6623 | 2.2766 | 1024 |
|
88 |
+
| facebook/opt-125m | 5836928 | 1.27 | 50265 | 45467 | 90.5 | 4.8 | 1 | 19.4051 | 3.7256 | 1E+030 |
|
89 |
+
| Vikhrmodels/Vikhr-YandexGPT-5-Lite-8B-it | 5984540 | 1.3 | 129024 | 51435 | 39.9 | 4.6 | 1 | 14.5142 | 3.0903 | 16384 |
|
90 |
+
| yandex/YandexGPT-5-Lite-8B-instruct | 5984540 | 1.3 | 129024 | 51435 | 39.9 | 4.6 | 1 | 15.081 | 4.5032 | 1E+030 |
|
91 |
+
| IlyaGusev/saiga_yandexgpt_8b | 5984540 | 1.3 | 129024 | 51435 | 39.9 | 4.6 | 1 | 15.7957 | 3.6403 | 32768 |
|
92 |
+
| loim/ru_en_tok_mini_bpe_12k | 6271746 | 1.36 | 12288 | 9611 | 78.2 | 4.4 | 1 | 41.6606 | 1.5217 | 65536 |
|
93 |
+
| TinyLlama/TinyLlama-1.1B-Chat-v1.0 | 6655231 | 1.44 | 32000 | 24919 | 77.9 | 4.2 | 1 | 43.1161 | 5.5738 | 2048 |
|
94 |
+
| ai-forever/ruGPT-3.5-13B | 7154363 | 1.55 | 50257 | 12582 | 25.0 | 3.9 | 0 | 15.711 | 11.2961 | 2048 |
|
95 |
+
| loim/ru_en_mini_bpe_8k | 7369398 | 1.6 | 8192 | 7456 | 91.0 | 3.8 | 1 | 32.1512 | 1.6195 | 32768 |
|
96 |
+
| ai-forever/rugpt3small_based_on_gpt2 | 7749641 | 1.68 | 50257 | 10938 | 21.8 | 3.6 | 0 | 16.4294 | 8.9582 | 2048 |
|
97 |
+
|
98 |
+
### Russian text (16315296 chars, 2185925 words)
|
99 |
+
| Tokenizer | Tokens | Compression | Vocab Size | Vocab Used | Vocab Usage % | Avg Token Length | Perfect Detokenization | Tokenization Time (s) | Detokenization Time (s) | Max Length |
|
100 |
+
|---|---|---|---|---|---|---|---|---|---|---|
|
101 |
+
| Vikhrmodels/Vikhr-YandexGPT-5-Lite-8B-it | 3475768 | 1.59 | 129024 | 67971 | 52.7 | 4.7 | 1 | 9.6723 | 1.4114 | 16384 |
|
102 |
+
| IlyaGusev/saiga_yandexgpt_8b | 3475768 | 1.59 | 129024 | 67971 | 52.7 | 4.7 | 1 | 10.1863 | 1.8007 | 32768 |
|
103 |
+
| yandex/YandexGPT-5-Lite-8B-instruct | 3475768 | 1.59 | 129024 | 67971 | 52.7 | 4.7 | 1 | 10.3878 | 4.8323 | 1E+030 |
|
104 |
+
| ai-forever/ruGPT-3.5-13B | 3693945 | 1.69 | 50257 | 43208 | 86.0 | 4.4 | 0 | 16.1615 | 3.9659 | 2048 |
|
105 |
+
| RefalMachine/RuadaptQwen3-32B-Instruct | 3732533 | 1.71 | 146213 | 52564 | 36.0 | 4.4 | 1 | 16.5792 | 2.4271 | 131072 |
|
106 |
+
| ai-forever/rugpt3small_based_on_gpt2 | 3801887 | 1.74 | 50257 | 42820 | 85.2 | 4.3 | 0 | 17.1418 | 2.9581 | 2048 |
|
107 |
+
| loim/ru_en_tok_mini_bpe_12k | 4070967 | 1.86 | 12288 | 9306 | 75.7 | 4.0 | 1 | 35.0603 | 1.3202 | 65536 |
|
108 |
+
| deepseek-ai/DeepSeek-V3 | 4806676 | 2.2 | 128000 | 21621 | 16.9 | 3.4 | 1 | 15.8833 | 2.2505 | 131072 |
|
109 |
+
| IlyaGusev/saiga_nemo_12b | 4926095 | 2.25 | 131072 | 21901 | 16.7 | 3.3 | 1 | 15.2355 | 3.6558 | 1024000 |
|
110 |
+
| Gensyn/Qwen2.5-1.5B-Instruct | 5411283 | 2.48 | 151643 | 20458 | 13.5 | 3.0 | 1 | 14.6061 | 1.9548 | 131072 |
|
111 |
+
| deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | 5411284 | 2.48 | 151643 | 20459 | 13.5 | 3.0 | 1 | 16.4851 | 1.5277 | 16384 |
|
112 |
+
| TinyLlama/TinyLlama-1.1B-Chat-v1.0 | 5986567 | 2.74 | 32000 | 13454 | 42.0 | 2.7 | 1 | 20.6121 | 1.9489 | 2048 |
|
113 |
+
| loim/ru_en_mini_bpe_8k | 6090683 | 2.79 | 8192 | 5749 | 70.2 | 2.7 | 1 | 24.6047 | 1.4503 | 32768 |
|
114 |
+
| openai-community/gpt2 | 16931837 | 7.75 | 50257 | 13818 | 27.5 | 1.0 | 1 | 19.4 | 6.16 | 1024 |
|
115 |
+
| facebook/opt-125m | 16931838 | 7.75 | 50265 | 13819 | 27.5 | 1.0 | 1 | 22.1165 | 4.2726 | 1E+030 |
|
116 |
+
|
117 |
+
### Toki pona text (3663780 chars, 831463 words)
|
118 |
+
| Tokenizer | Tokens | Compression | Vocab Size | Vocab Used | Vocab Usage % | Avg Token Length | Perfect Detokenization | Tokenization Time (s) | Detokenization Time (s) | Max Length |
|
119 |
+
|---|---|---|---|---|---|---|---|---|---|---|
|
120 |
+
| loim/ru_en_tok_mini_bpe_12k | 1144322 | 1.38 | 12288 | 2927 | 23.8 | 3.2 | 1 | 4.145 | 0.2371 | 65536 |
|
121 |
+
| IlyaGusev/saiga_nemo_12b | 1332599 | 1.6 | 131072 | 8428 | 6.4 | 2.7 | 1 | 2.7613 | 0.7956 | 1024000 |
|
122 |
+
| deepseek-ai/DeepSeek-V3 | 1343359 | 1.62 | 128000 | 8870 | 6.9 | 2.7 | 1 | 2.6998 | 0.4471 | 131072 |
|
123 |
+
| RefalMachine/RuadaptQwen3-32B-Instruct | 1396348 | 1.68 | 146213 | 7546 | 5.2 | 2.6 | 1 | 2.3745 | 2.2573 | 131072 |
|
124 |
+
| Gensyn/Qwen2.5-1.5B-Instruct | 1393944 | 1.68 | 151643 | 7931 | 5.2 | 2.6 | 1 | 2.181 | 0.3505 | 131072 |
|
125 |
+
| deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | 1393945 | 1.68 | 151643 | 7932 | 5.2 | 2.6 | 1 | 2.6367 | 0.3489 | 16384 |
|
126 |
+
| Vikhrmodels/Vikhr-YandexGPT-5-Lite-8B-it | 1481531 | 1.78 | 129024 | 7306 | 5.7 | 2.5 | 1 | 2.2853 | 1.3855 | 16384 |
|
127 |
+
| yandex/YandexGPT-5-Lite-8B-instruct | 1481531 | 1.78 | 129024 | 7306 | 5.7 | 2.5 | 1 | 2.359 | 1.2527 | 1E+030 |
|
128 |
+
| IlyaGusev/saiga_yandexgpt_8b | 1481531 | 1.78 | 129024 | 7306 | 5.7 | 2.5 | 1 | 2.5027 | 2.1723 | 32768 |
|
129 |
+
| TinyLlama/TinyLlama-1.1B-Chat-v1.0 | 1536792 | 1.85 | 32000 | 6322 | 19.8 | 2.4 | 1 | 4.2253 | 0.6623 | 2048 |
|
130 |
+
| openai-community/gpt2 | 1550846 | 1.87 | 50257 | 6680 | 13.3 | 2.4 | 1 | 2.7572 | 0.7449 | 1024 |
|
131 |
+
| facebook/opt-125m | 1550847 | 1.87 | 50265 | 6681 | 13.3 | 2.4 | 1 | 2.4144 | 0.6391 | 1E+030 |
|
132 |
+
| ai-forever/ruGPT-3.5-13B | 1828262 | 2.2 | 50257 | 3881 | 7.7 | 2.0 | 0 | 2.1597 | 0.7194 | 2048 |
|
133 |
+
| ai-forever/rugpt3small_based_on_gpt2 | 1925501 | 2.32 | 50257 | 3697 | 7.4 | 1.9 | 0 | 1.9954 | 0.8262 | 2048 |
|
134 |
+
| loim/ru_en_mini_bpe_8k | 2123707 | 2.55 | 8192 | 2709 | 33.1 | 1.7 | 1 | 2.4541 | 0.3799 | 32768 |
|
135 |
+
|
136 |
+
</details>
|