Fill-Mask
Transformers
Safetensors
Russian
English
modernbert
TatonkaHF commited on
Commit
797f7ff
·
verified ·
1 Parent(s): 3d74599

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +27 -0
README.md CHANGED
@@ -21,6 +21,33 @@ RuModernBERT was pre-trained on approximately 2 trillion tokens of Russian, Engl
21
  | [deepvk/RuModernBERT-small](https://huggingface.co/deepvk/RuModernBERT-small) | 35M | 384 | 12 | 50368 | 8192 | Masked LM |
22
  | deepvk/RuModernBERT-base [this] | 150M | 768 | 22 | 50368 | 8192 | Masked LM |
23
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
  ## Usage
26
 
 
21
  | [deepvk/RuModernBERT-small](https://huggingface.co/deepvk/RuModernBERT-small) | 35M | 384 | 12 | 50368 | 8192 | Masked LM |
22
  | deepvk/RuModernBERT-base [this] | 150M | 768 | 22 | 50368 | 8192 | Masked LM |
23
 
24
+ ## Notice ⚠️
25
+
26
+ The patched tokenizer is provided under the [patched-tokenizer](https://huggingface.co/deepvk/RuModernBERT-base/tree/patched-tokenizer) revision.
27
+
28
+ <details>
29
+ <summary>Details</summary>
30
+
31
+ We observed that several Russian lowercase letters were split into multiple subword tokens. This can be problematic for tasks like Named Entity Recognition (NER), where it is important that the first token of a word is a semantically meaningful unit.
32
+
33
+ To address this, we release a patched revision of the tokenizer with minimal but targeted change. Six common Russian lowercase letters *(а, е, и, н, о, т)* are now encoded as single tokens. These tokens were assigned to [unusedX] slots in the vocabulary. Corresponding BPE merges were added to ensure proper single-token encoding during inference. To preserve compatibility with the pretrained model each new token was initialized with the embedding of its uppercase counterpart both in tok_embedding and lm_head. To prevent duplicate vectors and maintain robustness, a small amount of Gaussian noise was added during initialization with gamma 1e-3.
34
+
35
+ We evaluated the patched model on 20 tasks from the RuMTEB benchmark and did not observe any statistically significant differences in performance compared to the original version. If your task is sensitive to tokenization granularity, such as in NER, we recommend using this updated version.
36
+
37
+ Usage example:
38
+
39
+ ```python
40
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
41
+
42
+ model_id = "deepvk/RuModernBERT-base"
43
+
44
+ # You can specify revision
45
+ revision = "patched-tokenizer"
46
+ tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)
47
+ model = AutoModelForMaskedLM.from_pretrained(model_id, revision=revision, attn_implementation="flash_attention_2")
48
+ ```
49
+
50
+ </details>
51
 
52
  ## Usage
53