Language support?

#1
by maximegmd - opened

Hello,

Was the model trained exclusively on English?

Hello,
Was the model trained exclusively on English?

@maximegmd Hi, this model is trained on the olmOCR dataset which contains multilingual data. The base model Qwen2.5-vl is also multilingual. I would expect the model to handle use cases more than just English document.

Hello,
Was the model trained exclusively on English?

@maximegmd Hi, this model is trained on the olmOCR dataset which contains multilingual data. The base model Qwen2.5-vl is also multilingual. I would expect the model to handle use cases more than just English document.

The olmOCR dataset should not contain multilingual data. They mention non-English text was filtered during dataset acquisition.

"Using the Lingua package (Emond, 2025), we identify and filter out documents that were not in English."

Sign up or log in to comment