Fill-Mask
Transformers
Safetensors
Russian
English
modernbert

Tokenizer splits single-letter words on bytes

#1
by nshmyrevgmail - opened
#!/usr/bin/env python3

from transformers import AutoTokenizer, AutoModel

device = 'cpu'
model = AutoModel.from_pretrained("RuModernBERT-small").to(device)
tokenizer = AutoTokenizer.from_pretrained("RuModernBERT-small")
model.eval()
    
text = "а и у"
print ([tokenizer.decode(x) for x in tokenizer(text)['input_ids']])

for this code result is:

['[CLS]', '�', '�', ' ', '�', '�', ' у', '[SEP]']

it split а and и on separate bytes. Is it intentional tokenization or a bug in transformers?

transformers version 4.51.3

deep vk org

Hello!
It's clearly not a bug, and this behavior is possible with BPE - single characters can be split into multiple tokens. We agree it’s not good to work with in some cases. To fix it, we released a revision with a patched tokenizer where common Russian letters are single tokens.

You can try it like this:

from transformers import AutoTokenizer, AutoModel

model = AutoModel.from_pretrained("deepvk/RuModernBERT-small", revision="patched-tokenizer")
tokenizer = AutoTokenizer.from_pretrained("deepvk/RuModernBERT-small", revision="patched-tokenizer")

text = "а и у"
print([tokenizer.decode(x) for x in tokenizer(text)['input_ids']])
# Outpus is: ['[CLS]', 'а', ' ', 'и', ' у', '[SEP]']

Great, thank you!

SpirinEgor changed discussion status to closed

Sign up or log in to comment