reducto/RolmOCR · Language support?

maximegmd

Apr 5

Hello,

Was the model trained exclusively on English?

yifeihu

Apr 5

Hello,
Was the model trained exclusively on English?

@maximegmd Hi, this model is trained on the olmOCR dataset which contains multilingual data. The base model Qwen2.5-vl is also multilingual. I would expect the model to handle use cases more than just English document.

andrei-41

Apr 25

Hello,
Was the model trained exclusively on English?

@maximegmd Hi, this model is trained on the olmOCR dataset which contains multilingual data. The base model Qwen2.5-vl is also multilingual. I would expect the model to handle use cases more than just English document.

The olmOCR dataset should not contain multilingual data. They mention non-English text was filtered during dataset acquisition.

"Using the Lingua package (Emond, 2025), we identify and filter out documents that were not in English."