The tokenizer vocab contains mosly English words and latin script rather than Arabic

#6
by issam9 - opened

Hi,
It seems that the tokenizer is not trained on text with mainly Arabic script. When applied to Arabic text it comes out over segmented and the performance of the model on my task is a lot worse compared to other Arabic models. When I checked vocab.txt file it seems to contain mostly English tokens.

Sign up or log in to comment