Language support?
Hello,
Was the model trained exclusively on English?
Hello,
Was the model trained exclusively on English?
@maximegmd Hi, this model is trained on the olmOCR dataset which contains multilingual data. The base model Qwen2.5-vl is also multilingual. I would expect the model to handle use cases more than just English document.
Hello,
Was the model trained exclusively on English?@maximegmd Hi, this model is trained on the olmOCR dataset which contains multilingual data. The base model Qwen2.5-vl is also multilingual. I would expect the model to handle use cases more than just English document.
The olmOCR dataset should not contain multilingual data. They mention non-English text was filtered during dataset acquisition.
"Using the Lingua package (Emond, 2025), we identify and filter out documents that were not in English."