Tokenizer splits single-letter words on bytes
#1
by
nshmyrevgmail
- opened
#!/usr/bin/env python3
from transformers import AutoTokenizer, AutoModel
device = 'cpu'
model = AutoModel.from_pretrained("RuModernBERT-small").to(device)
tokenizer = AutoTokenizer.from_pretrained("RuModernBERT-small")
model.eval()
text = "а и у"
print ([tokenizer.decode(x) for x in tokenizer(text)['input_ids']])
for this code result is:
['[CLS]', '�', '�', ' ', '�', '�', ' у', '[SEP]']
it split а and и on separate bytes. Is it intentional tokenization or a bug in transformers?
transformers version 4.51.3
Hello!
It's clearly not a bug, and this behavior is possible with BPE - single characters can be split into multiple tokens. We agree it’s not good to work with in some cases. To fix it, we released a revision with a patched tokenizer where common Russian letters are single tokens.
You can try it like this:
from transformers import AutoTokenizer, AutoModel
model = AutoModel.from_pretrained("deepvk/RuModernBERT-small", revision="patched-tokenizer")
tokenizer = AutoTokenizer.from_pretrained("deepvk/RuModernBERT-small", revision="patched-tokenizer")
text = "а и у"
print([tokenizer.decode(x) for x in tokenizer(text)['input_ids']])
# Outpus is: ['[CLS]', 'а', ' ', 'и', ' у', '[SEP]']
Great, thank you!
SpirinEgor
changed discussion status to
closed