loim commited on
Commit
83f51f4
·
verified ·
1 Parent(s): 7b35f02

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +136 -3
README.md CHANGED
@@ -1,3 +1,136 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - Vikhrmodels/GrandMaster-PRO-MAX
5
+ - Den4ikAI/ru_sberquad_long_answers
6
+ - HuggingFaceH4/ultrachat_200k
7
+ - IlyaGusev/gpt_roleplay_realm
8
+ - loim/characters_dialogs
9
+ - OpenAssistant/oasst1
10
+ - OpenAssistant/oasst2
11
+ language:
12
+ - ru
13
+ - en
14
+ pipeline_tag: token-classification
15
+ tags:
16
+ - bpe
17
+ - tokenizer
18
+ - tokipona
19
+ ---
20
+
21
+ <details>
22
+ <summary>🇷🇺 Русский...</summary>
23
+
24
+ # **Русско-английский BPE-токенизатор**
25
+ Оптимизированный токенизатор для трехязычных текстов с расширенной поддержкой русской лексики и эффективной обработкой английского и токипона.
26
+
27
+ ## **Ключевые характеристики**
28
+ - **Формат**: BPE (Byte-Pair Encoding)
29
+ - **Размер словаря**: 12 288 токенов
30
+ - **Языки**: Русский + Английский + Токипона (просто потому что могу и это ничего не стоит)
31
+ - **Специальные токены**:
32
+ `<|endoftext|>`
33
+ `<|padding|>`
34
+ `<|mask|>`
35
+ `<|user|>`
36
+ `<|assistant|>`
37
+ `<|system|>`
38
+ `<|end|>`
39
+ `<|en|>`
40
+ `<|ru|>`
41
+ `<|tok|>`
42
+ `<|`
43
+ `|>`
44
+
45
+ </details>
46
+
47
+
48
+ <details>
49
+ <summary>🇬🇧 English...</summary>
50
+
51
+ # **Russian-English BPE tokenizer**
52
+ Optimized tokenizer for trilingual texts with extended support for Russian vocabulary and efficient processing of English and Toki pona.
53
+
54
+ ## **Key Features**
55
+ - **Format**: BPE (Byte-Pair Encoding)
56
+ - **Dictionary size**: 12 288 tokens
57
+ - **Languages**: Russian + English + Toki pona (just because I can and it costs nothing)
58
+ - **Special tokens**:
59
+ `<|endoftext|>`
60
+ `<|padding|>`
61
+ `<|mask|>`
62
+ `<|user|>`
63
+ `<|assistant|>`
64
+ `<|system|>`
65
+ `<|end|>`
66
+ `<|en|>`
67
+ `<|ru|>`
68
+ `<|tok|>`
69
+ `<|`
70
+ `|>`
71
+
72
+ </details>
73
+
74
+ ---
75
+
76
+ <details>
77
+ <summary>🧪 Tests...</summary>
78
+
79
+ ### English text (27741474 chars, 4613167 words)
80
+ | Tokenizer | Tokens | Compression | Vocab Size | Vocab Used | Vocab Usage % | Avg Token Length | Perfect Detokenization | Tokenization Time (s) | Detokenization Time (s) | Max Length |
81
+ |---|---|---|---|---|---|---|---|---|---|---|
82
+ | deepseek-ai/DeepSeek-V3 | 5639822 | 1.22 | 128000 | 60979 | 47.6 | 4.9 | 1 | 17.8162 | 3.7699 | 131072 |
83
+ | RefalMachine/RuadaptQwen3-32B-Instruct | 5705024 | 1.24 | 146213 | 61580 | 42.1 | 4.9 | 1 | 17.6528 | 4.2012 | 131072 |
84
+ | Gensyn/Qwen2.5-1.5B-Instruct | 5708987 | 1.24 | 151643 | 60135 | 39.7 | 4.9 | 1 | 19.3785 | 3.9194 | 131072 |
85
+ | deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | 5708988 | 1.24 | 151643 | 60136 | 39.7 | 4.9 | 1 | 18.9563 | 1.6886 | 16384 |
86
+ | IlyaGusev/saiga_nemo_12b | 5806480 | 1.26 | 131072 | 56865 | 43.4 | 4.8 | 1 | 18.4329 | 3.1752 | 1024000 |
87
+ | openai-community/gpt2 | 5836927 | 1.27 | 50257 | 45466 | 90.5 | 4.8 | 1 | 16.6623 | 2.2766 | 1024 |
88
+ | facebook/opt-125m | 5836928 | 1.27 | 50265 | 45467 | 90.5 | 4.8 | 1 | 19.4051 | 3.7256 | 1E+030 |
89
+ | Vikhrmodels/Vikhr-YandexGPT-5-Lite-8B-it | 5984540 | 1.3 | 129024 | 51435 | 39.9 | 4.6 | 1 | 14.5142 | 3.0903 | 16384 |
90
+ | yandex/YandexGPT-5-Lite-8B-instruct | 5984540 | 1.3 | 129024 | 51435 | 39.9 | 4.6 | 1 | 15.081 | 4.5032 | 1E+030 |
91
+ | IlyaGusev/saiga_yandexgpt_8b | 5984540 | 1.3 | 129024 | 51435 | 39.9 | 4.6 | 1 | 15.7957 | 3.6403 | 32768 |
92
+ | loim/ru_en_tok_mini_bpe_12k | 6271746 | 1.36 | 12288 | 9611 | 78.2 | 4.4 | 1 | 41.6606 | 1.5217 | 65536 |
93
+ | TinyLlama/TinyLlama-1.1B-Chat-v1.0 | 6655231 | 1.44 | 32000 | 24919 | 77.9 | 4.2 | 1 | 43.1161 | 5.5738 | 2048 |
94
+ | ai-forever/ruGPT-3.5-13B | 7154363 | 1.55 | 50257 | 12582 | 25.0 | 3.9 | 0 | 15.711 | 11.2961 | 2048 |
95
+ | loim/ru_en_mini_bpe_8k | 7369398 | 1.6 | 8192 | 7456 | 91.0 | 3.8 | 1 | 32.1512 | 1.6195 | 32768 |
96
+ | ai-forever/rugpt3small_based_on_gpt2 | 7749641 | 1.68 | 50257 | 10938 | 21.8 | 3.6 | 0 | 16.4294 | 8.9582 | 2048 |
97
+
98
+ ### Russian text (16315296 chars, 2185925 words)
99
+ | Tokenizer | Tokens | Compression | Vocab Size | Vocab Used | Vocab Usage % | Avg Token Length | Perfect Detokenization | Tokenization Time (s) | Detokenization Time (s) | Max Length |
100
+ |---|---|---|---|---|---|---|---|---|---|---|
101
+ | Vikhrmodels/Vikhr-YandexGPT-5-Lite-8B-it | 3475768 | 1.59 | 129024 | 67971 | 52.7 | 4.7 | 1 | 9.6723 | 1.4114 | 16384 |
102
+ | IlyaGusev/saiga_yandexgpt_8b | 3475768 | 1.59 | 129024 | 67971 | 52.7 | 4.7 | 1 | 10.1863 | 1.8007 | 32768 |
103
+ | yandex/YandexGPT-5-Lite-8B-instruct | 3475768 | 1.59 | 129024 | 67971 | 52.7 | 4.7 | 1 | 10.3878 | 4.8323 | 1E+030 |
104
+ | ai-forever/ruGPT-3.5-13B | 3693945 | 1.69 | 50257 | 43208 | 86.0 | 4.4 | 0 | 16.1615 | 3.9659 | 2048 |
105
+ | RefalMachine/RuadaptQwen3-32B-Instruct | 3732533 | 1.71 | 146213 | 52564 | 36.0 | 4.4 | 1 | 16.5792 | 2.4271 | 131072 |
106
+ | ai-forever/rugpt3small_based_on_gpt2 | 3801887 | 1.74 | 50257 | 42820 | 85.2 | 4.3 | 0 | 17.1418 | 2.9581 | 2048 |
107
+ | loim/ru_en_tok_mini_bpe_12k | 4070967 | 1.86 | 12288 | 9306 | 75.7 | 4.0 | 1 | 35.0603 | 1.3202 | 65536 |
108
+ | deepseek-ai/DeepSeek-V3 | 4806676 | 2.2 | 128000 | 21621 | 16.9 | 3.4 | 1 | 15.8833 | 2.2505 | 131072 |
109
+ | IlyaGusev/saiga_nemo_12b | 4926095 | 2.25 | 131072 | 21901 | 16.7 | 3.3 | 1 | 15.2355 | 3.6558 | 1024000 |
110
+ | Gensyn/Qwen2.5-1.5B-Instruct | 5411283 | 2.48 | 151643 | 20458 | 13.5 | 3.0 | 1 | 14.6061 | 1.9548 | 131072 |
111
+ | deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | 5411284 | 2.48 | 151643 | 20459 | 13.5 | 3.0 | 1 | 16.4851 | 1.5277 | 16384 |
112
+ | TinyLlama/TinyLlama-1.1B-Chat-v1.0 | 5986567 | 2.74 | 32000 | 13454 | 42.0 | 2.7 | 1 | 20.6121 | 1.9489 | 2048 |
113
+ | loim/ru_en_mini_bpe_8k | 6090683 | 2.79 | 8192 | 5749 | 70.2 | 2.7 | 1 | 24.6047 | 1.4503 | 32768 |
114
+ | openai-community/gpt2 | 16931837 | 7.75 | 50257 | 13818 | 27.5 | 1.0 | 1 | 19.4 | 6.16 | 1024 |
115
+ | facebook/opt-125m | 16931838 | 7.75 | 50265 | 13819 | 27.5 | 1.0 | 1 | 22.1165 | 4.2726 | 1E+030 |
116
+
117
+ ### Toki pona text (3663780 chars, 831463 words)
118
+ | Tokenizer | Tokens | Compression | Vocab Size | Vocab Used | Vocab Usage % | Avg Token Length | Perfect Detokenization | Tokenization Time (s) | Detokenization Time (s) | Max Length |
119
+ |---|---|---|---|---|---|---|---|---|---|---|
120
+ | loim/ru_en_tok_mini_bpe_12k | 1144322 | 1.38 | 12288 | 2927 | 23.8 | 3.2 | 1 | 4.145 | 0.2371 | 65536 |
121
+ | IlyaGusev/saiga_nemo_12b | 1332599 | 1.6 | 131072 | 8428 | 6.4 | 2.7 | 1 | 2.7613 | 0.7956 | 1024000 |
122
+ | deepseek-ai/DeepSeek-V3 | 1343359 | 1.62 | 128000 | 8870 | 6.9 | 2.7 | 1 | 2.6998 | 0.4471 | 131072 |
123
+ | RefalMachine/RuadaptQwen3-32B-Instruct | 1396348 | 1.68 | 146213 | 7546 | 5.2 | 2.6 | 1 | 2.3745 | 2.2573 | 131072 |
124
+ | Gensyn/Qwen2.5-1.5B-Instruct | 1393944 | 1.68 | 151643 | 7931 | 5.2 | 2.6 | 1 | 2.181 | 0.3505 | 131072 |
125
+ | deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | 1393945 | 1.68 | 151643 | 7932 | 5.2 | 2.6 | 1 | 2.6367 | 0.3489 | 16384 |
126
+ | Vikhrmodels/Vikhr-YandexGPT-5-Lite-8B-it | 1481531 | 1.78 | 129024 | 7306 | 5.7 | 2.5 | 1 | 2.2853 | 1.3855 | 16384 |
127
+ | yandex/YandexGPT-5-Lite-8B-instruct | 1481531 | 1.78 | 129024 | 7306 | 5.7 | 2.5 | 1 | 2.359 | 1.2527 | 1E+030 |
128
+ | IlyaGusev/saiga_yandexgpt_8b | 1481531 | 1.78 | 129024 | 7306 | 5.7 | 2.5 | 1 | 2.5027 | 2.1723 | 32768 |
129
+ | TinyLlama/TinyLlama-1.1B-Chat-v1.0 | 1536792 | 1.85 | 32000 | 6322 | 19.8 | 2.4 | 1 | 4.2253 | 0.6623 | 2048 |
130
+ | openai-community/gpt2 | 1550846 | 1.87 | 50257 | 6680 | 13.3 | 2.4 | 1 | 2.7572 | 0.7449 | 1024 |
131
+ | facebook/opt-125m | 1550847 | 1.87 | 50265 | 6681 | 13.3 | 2.4 | 1 | 2.4144 | 0.6391 | 1E+030 |
132
+ | ai-forever/ruGPT-3.5-13B | 1828262 | 2.2 | 50257 | 3881 | 7.7 | 2.0 | 0 | 2.1597 | 0.7194 | 2048 |
133
+ | ai-forever/rugpt3small_based_on_gpt2 | 1925501 | 2.32 | 50257 | 3697 | 7.4 | 1.9 | 0 | 1.9954 | 0.8262 | 2048 |
134
+ | loim/ru_en_mini_bpe_8k | 2123707 | 2.55 | 8192 | 2709 | 33.1 | 1.7 | 1 | 2.4541 | 0.3799 | 32768 |
135
+
136
+ </details>