Tokenizer splits single-letter words on bytes

by nshmyrevgmail - opened May 28

May 28

#!/usr/bin/env python3

from transformers import AutoTokenizer, AutoModel

device = 'cpu'
model = AutoModel.from_pretrained("RuModernBERT-small").to(device)
tokenizer = AutoTokenizer.from_pretrained("RuModernBERT-small")
model.eval()
    
text = "а и у"
print ([tokenizer.decode(x) for x in tokenizer(text)['input_ids']])

for this code result is:

['[CLS]', '�', '�', ' ', '�', '�', ' у', '[SEP]']

it split а and и on separate bytes. Is it intentional tokenization or a bug in transformers?

transformers version 4.51.3

TatonkaHF

deep vk org 24 days ago

Hello!
It's clearly not a bug, and this behavior is possible with BPE - single characters can be split into multiple tokens. We agree it’s not good to work with in some cases. To fix it, we released a revision with a patched tokenizer where common Russian letters are single tokens.

You can try it like this:

from transformers import AutoTokenizer, AutoModel

model = AutoModel.from_pretrained("deepvk/RuModernBERT-small", revision="patched-tokenizer")
tokenizer = AutoTokenizer.from_pretrained("deepvk/RuModernBERT-small", revision="patched-tokenizer")

text = "а и у"
print([tokenizer.decode(x) for x in tokenizer(text)['input_ids']])
# Outpus is: ['[CLS]', 'а', ' ', 'и', ' у', '[SEP]']

nshmyrevgmail

24 days ago

Great, thank you!

SpirinEgor changed discussion status to closed 7 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment