The tokenizer vocab contains mosly English words and latin script rather than Arabic

by issam9 - opened about 1 month ago

about 1 month ago

Hi,
It seems that the tokenizer is not trained on text with mainly Arabic script. When applied to Arabic text it comes out over segmented and the performance of the model on my task is a lot worse compared to other Arabic models. When I checked vocab.txt file it seems to contain mostly English tokens.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment