AtlasOCR: Building the First Open-Source Darija OCR Model with Vision Language Models

Community Article Published September 16, 2025

Upvote

Soufiane AIT EL AOUAD

eniafou

Introduction

Darija, the Moroccan Arabic dialect, is incredibly rich in visual content, from social media posts to handwritten notes. However, the lack of specialized OCR tools for Darija has been a significant barrier for developers, and organizations working with Moroccan content. In this post, we'll explore how we built AtlasOCR, the first open-source Darija OCR model of 3B parameters, by finetuning a Vision Language Model (VLM). We'll dive deep into the technical implementation, from data curation to model training, and share our findings on achieving state-of-the-art performance.

What you'll learn:

How to use AtlasOCR
VLMs, their architecture, and how they’re useful for OCR
Our approach to curating Darija-specific training data
Training strategies of Qwen2.5-VL 3B using Unsloth, and QLoRA
Comprehensive ablation studies and their results

Demo: AtlasOCR-demo
Model: atlasia/AtlasOCR
Dataset: atlasia/atlasOCR-data
AtlasOCRBench: atlasia/AtlasOCRBench

How to use AtlasOCR
Background
Vision Language Models
Data Curation
Training
Evaluation
Results
Conclusion
Acknowledgment
Call to action

How to use AtlasOCR

You can use AtlasOCR on your own images by uploading them to the AtlasOCR Hugging Face Space. You will need ZeroGPU to run the model.

Or you can load the model locally with the following command:

from unsloth import FastVisionModel 
from PIL import Image
from transformers import TextStreamer

image = Image.open("path/to/your/image.jpg")

model, processor = FastVisionModel.from_pretrained(
    "atlasia/AtlasOCR",
    device_map="cuda:0",
    load_in_4bit=False,
    use_gradient_checkpointing="unsloth"
  )
FastVisionModel.for_inference(model) 

prompt = ("Below is the image of one page of a document written in arabic."
        "Just return the plain text representation of this document as if you were reading it naturally. Do not hallucinate.")

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": prompt}
    ]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt = True)
inputs = processor(
    image,
    input_text,
    add_special_tokens = False,
    return_tensors = "pt",
).to("cuda")

text_streamer = TextStreamer(processor, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

Background

Why is this important?

The utility of OCR in Darija extends far beyond academic interest:

Digital preservation: Converting historical documents and manuscripts
Social media analysis: Understanding public discourse and sentiment
Accessibility: Making visual content accessible to screen readers
Research: Enabling large-scale text analysis of Moroccan content

Vision Language Models

Vision-language models (VLMs) take images and text as input and generate text as output. They excel at zero-shot generalization and support a wide range of applications such as visual question answering, document understanding, image captioning, and interactive dialogue about images. These models have opened up new possibilities for OCR, especially for under-resourced languages like Darija. By combining visual understanding with language modeling, LVLMs can interpret text in images more effectively than traditional OCR systems.

For our project, fine-tuning a VLM allowed us to take advantage of its ability to capture both the visual layout of text and the linguistic context of Darija. This made it possible to accurately recognize and extract text from a diverse range of real-world images.

Architecture in a nutshell

A VLM has three main components:

Vision encoder: converts an image/video into a vector embedding that encapsulates visual properties like color, shapes, etc.
Modality Projection module: aligns the visual features from the previous embeddings with the representation space of the language model.
Language model: takes the aligned embeddings, integrates them with any text input, and generates meaningful outputs in natural language.

The figure below is a simplified architecture of a Vision Language Model (VLM), composed of a vision encoder, a modality projection module, and a language model. (Source: Hugging Face Blog)

Data Curation

Building a large-scale OCR dataset for Moroccan Darija meant tackling one big challenge: real-world variety. Our goal was to capture this diversity while ensuring enough scale and quality to train robust models.

To get there, we combined two complementary approaches: synthetic generation with our own library, OCRSmith, and carefully curated real-world data from books, documents, and online sources.

OCRSmith: a toolkit for synthetic OCR data generation

Creating high-quality annotations for Darija is time-consuming and expensive. Synthetic data offered us a way to move fast while maintaining diversity. With OCRSmith, our open-source toolkit, we could simulate real-world conditions—fonts, layouts, backgrounds, distortions, and instantly generate tens of thousands of labeled images complete with bounding boxes and metadata.

Here are a few examples of synthetic Darija text generated with OCRSmith:

Real-World Data Sources

Synthetic data gave us scale, but real images gave us authenticity. We curated a wide mix of Darija text from different contexts:

Scanned Books

While books in Darija are rare, we tracked down two valuable sources العَرَبِيَّةُ الدَّارِجَةُ by Mohammed El-Madlaoui El-Mounabhi and علشان الصغيرة والصغير by Farouk ElMarrakchi. From these, we extracted around 700 pages of high-quality Darija text, later enriched with pseudo-labels generated by Gemini 2.0 Flash.

Social Media Images

Platforms like LinkedIn turned out to be surprisingly rich in Darija content. We collected poster-style PDFs, many with educational material, and converted them into images for OCR training.

Educational Documents

Darija appears frequently in Moroccan study materials, especially in driving license exams. These sources weren’t always clean scans—some were faded or cluttered—but with careful cropping and preprocessing, we recovered valuable text samples.

Cookbooks

Recipes written in Darija offered another unique source. We scanned Moroccan cookbooks, cropped out decorative elements, and enhanced contrast so that ingredient lists and instructions were clear and consistent for OCR.

Dataset Overview

By combining synthetic and real-world sources, we built the first large-scale Darija OCR

Split	Samples	Total Words
Train	26162	9.5M
Validation	3930	1.2M
Total	30092	10.7M

In terms of composition, about 86% of the dataset is synthetic, while 14% comes from real-world sources. This hybrid design gave us the best of both worlds: scale from synthetic data and authenticity from real-world images.

Training

Model Selection

Choosing the right base model was crucial for building a performant Darija OCR system under limited resources. We ran a benchmark evaluation on a manually curated set of 55 real-world images. We compared several open-source vision-language models, including:

While larger models exist, we focused on compact architectures (2B–3B parameters) to keep both training and inference efficient and accessible.

The results showed that Qwen2.5-VL 3B consistently outperformed the alternatives in handling Darija text across varied domains.

Training Strategy

To fine-tune Qwen2.5‑VL 3B for Moroccan Darija OCR, we adopted a parameter-efficient approach combining QLoRA and Unsloth, optimizing both performance and resource usage.

QLoRA (Quantized Low-Rank Adaptation) enables fine-tuning large models by quantizing them to 4-bit precision and introducing low-rank adapters. This method significantly reduces memory requirements—by up to 80%—while maintaining performance comparable to full fine-tuning .

Unsloth is a high-performance framework designed for efficient LLM fine-tuning. It accelerates training by up to 5x and reduces memory usage by 60% through optimized GPU kernels and memory management techniques .

Ablation Studies

To enhance the performance of Qwen2.5-VL 3B for Moroccan Darija OCR tasks, we conducted a series of ablation studies. These experiments aimed to fine-tune key hyperparameters, assess the impact of different configurations, and identify the optimal settings for our training.

LoRA Hyperparameters (r, alpha, dropout)

We explored various combinations of the rank (r), scaling factor (alpha), and dropout rate to determine their effect on model performance.

r	alpha	dropout	Min Eval Loss
32	32	0.00	0.2442
32	32	0.05	0.2456
64	64	0.05	0.2251
128	128	0.05	0.2132

Increasing both the rank (r) and scaling factor (alpha) improved performance, suggesting that allocating more parameters to the low-rank adapters enhances the model's capacity to learn task-specific features. Notably, setting dropout to 0 did not negatively impact performance, indicating that dropout may not be essential for this fine-tuning scenario.

Quantization Impact

We compared the effects of using 4-bit and 16-bit precision on model performance.

Precision	Min Eval Loss
4-bit	0.2132
16-bit	0.2124

The minimal difference in evaluation loss between 4-bit and 16-bit precision indicates that 4-bit quantization does not compromise performance. Given the significant reduction in memory usage and computational requirements, we opted for 4-bit precision to optimize resource utilization.

Batch Size and Learning Rate

The goal of the following experiments was to find the best learning rate x batch size for the model for both 16 and 128 rank and alpha.

Batch Size	Learning Rate	Min Eval Loss
16	2e-4	0.3161
16	6e-4	0.2326
16	2e-3	4.0958
64	2e-4	0.3479
128	2e-4	0.4086
128	8e-4	0.2733
128	2e-3	0.2350
512	2e-4	0.5725

Table1: Validation performance when fine-tuning Qwen2.5-VL 3B with LoRA (r = α = 16). The best result is highlighted in bold.

Batch Size	Learning Rate	Eval Loss
16	6e-5	0.2652
16	2e-4	0.2132
128	2e-4	0.2456
128	8e-4	0.2165
128	2e-3	8.2561

Table2: Validation performance when fine-tuning Qwen2.5-VL 3B with LoRA (r = α = 128). The best result is highlighted in bold.

In general, increasing the learning rate generally facilitated faster convergence (Table 1 and Table 2), provided it did not lead to divergence (as it happens with lr= 2e-3). Also, increasing gradient accumulation is good but needs recalibration of the learning rate. We used gradient accumulation instead of increasing the batch size to optimize the memory usage.

When increasing the rank and alpha values (see Table 2), it was necessary to adjust the learning rate and batch size to maintain optimal performance ^deepseek-scaling.

Vision Layer Freezing

We evaluated the impact of freezing the vision layers during fine-tuning.

Finetune Vision Layers	Min Eval Loss
Yes	0.2155
No	0.3173

Allowing the vision layers to be fine-tuned resulted in better performance, suggesting that adapting these layers to the specific characteristics of Moroccan Darija text enhances the model's recognition capabilities.

RSLoRA

Rank-Stabilized LoRA (rsLoRA) is an enhancement of the original LoRA method, designed to correct limitations in how LoRA scales with adapter rank. rsLoRA modifies the scaling factor in LoRA adapters, ensuring stable activations and gradients as the adapter rank increases, thereby unlocking better performance at higher ranks. For more details, check this blogpost: https://huggingface.co/blog/damjan-k/rslora

RSLoRA Enabled	Min Eval Loss
No	0.2132
Yes	8.2561

Enabling RSLoRA led to a significant degradation in performance, indicating that this method did not contribute positively to the fine-tuning process for our specific task. Further analysis is required to understand the underlying reasons for this outcome.

Evaluation

Benchmark

Benchmark Curation

To measure the real-world performance of AtlasOCR, we built AtlasOCRBench a comprehensive evaluation benchmark tailored specifically for Darija. It brings together two key data sources:

Scanned Darija books — high-quality, real-world printed text
Synthetic data from OCRSmith — clean, controlled samples designed to test specific OCR challenges.

To create the benchmark dataset, we adopted a two-step pseudo-labeling process:

Pseudo-labeling with Gemini API
For the scanned book images, we used Gemini 2.0 Flash to produce the first draft of extracted text. Our prompt was carefully crafted to prioritize human readability over preserving the original text block layout:

Extract the text from the provided image without translating it.
Make sure the output is formatted in a human-readable format; this is more important than just preserving the placement of text blocks as they are.
Output only the extracted text and nothing else.

Human annotation
Using Argilla for collaborative editing, we reviewed, corrected, and standardized the text to ensure high-quality ground truth.

Benchmark Composition

Our benchmark contains 251 samples in total, including 55 sourced from scanned books. This dataset covers a wide range of text types and difficulty levels, ensuring that evaluation results reflect realistic OCR challenges.

Limitations

At present, our benchmark is limited in the variety of its data sources. We plan to enhance its diversity in future iterations of this work.

Evaluation Metrics

To measure AtlasOCR’s accuracy and compare it with other models, we evaluated the model using Character Error Rate (CER) and Word Error Rate (WER), two standard metrics in OCR research.

Character Error Rate (CER) CER measures the number of character-level edits (insertions, deletions, substitutions) needed to transform the OCR output into the ground truth, normalized by the length of the ground truth text. CER ignores word boundaries and focuses purely on character accuracy.
Word Error Rate (WER) WER measures the number of word-level edits required to match the prediction with the ground truth, normalized by the number of words in the ground truth. WER is useful for understanding how often an entire word is misrecognized, but for Darija it can be misleading, even a single character difference can mark a word as “wrong,” inflating the error rate.

While both provide valuable insights, CER is particularly well-suited for Darija, and here’s why:

Darija lacks a standardized spelling system, the same word may appear with different orthographic variants.
Evaluating at the word level (as WER does) would over-penalize small spelling variations that don’t affect meaning.
CER is more sensitive to minor recognition errors such as missing letters or incorrect diacritics.

Methodology

Our evaluation process ensures fairness and consistency before calculating CER and WER:

Text Normalization
- Remove Arabic diacritics (harakat), which are optional in most modern text and inconsistently used in Darija materials.
- Replace line breaks with spaces and strip extra whitespace.
- This step ensures that superficial differences (like formatting) don’t affect evaluation.
Metric Calculation
- CER: Remove spaces, compare at character level.
- WER: Tokenize by spaces, compare at word level.

In practice, CER is our primary evaluation metric for AtlasOCR, as it better reflects the true difficulty of OCR in a language without fixed spelling conventions.

Results

Leaderboard

We evaluated AtlasOCR on two major benchmarks: KITAB-Bench and our own AtlasOCRBench. By comparing AtlasOCR’s performance on both KITAB-Bench and AtlasOCRBench, we provide a comprehensive assessment of its capabilities on standard Arabic OCR tasks as well as the unique challenges present in Moroccan Darija and real-world documents.

AtlasOCRBench

KITAB-Bench

KITAB-Bench is a large-scale, multi-domain benchmark for Arabic OCR and document understanding, covering over 8,800 samples across diverse domains such as printed and handwritten text, tables, charts, and complex layouts.

Even with a primary training focus on Darija, AtlasOCR shows remarkable generalization to the standard Arabic KITAB-Bench, where it competes with larger models like Gemma3 (12B) and Qwen2.5-VL (7B). This performance proves that AtlasOCR is not limited to its specialized domain but is fundamentally robust for broader Arabic OCR tasks.

Diverse Examples

Our model demonstrates robust performance across various scenarios:

AtlasOCR	Qari-OCR-v0.3-VL-2B-Instruct	Qwen2.5-vl-3b-instruct
الصبر للجفاف الجاكارندا كاتصبر للجفاف شّووية، فاش كاتنضج، كاتقدر تصبر لثوبات ديال 4 تال 6 سيمانات بلامة، ولكن فاش كاتكون باقة صغيرة، كايخصها تتسقى عللدوام باش تجدّر مزيان. لاكان السقي مستامر وقت الجفاف، الشجرة كاتكبر صحيبة وكاتتورّب بكترة . الارض اللي كاتفضل الجاكارندا تقدر تتغرس فݣاع الارض اللي تصفيتها مزيانة، لا ماكانتش مدكوكهة ولّا مشّيعة ما ولّا فايتة الحدّ فالطين ولّا الجير ولّا السيليكا.	الصبر للجفافالجاكاراندا كاتصر للجفاف شوية، فاش كاتنضح، كاتقدر تصر لنوبات ديال 4 تال 6 سيمانات بلا ما، ولكن فاش كانكون باقة صغيرة، كايخصها تتسقى عللدوام باش تجدّر مزيان، لا كان السقي مستامّر وقت الجفاف، الشجرة كانكبر صحيحة وكاتنوّرّ بكترة .الارض اللي كاتفضلالجاكاراندا تقدر تنغرس فكّاع الارض اللي تصفيتها مزيانة، لا ماكاتش مدكوة وَّا مشبّعة ما ولّا فايتة الحدّ فالطين ولّا الجير ولّا السيليكا .	الصبر للجفاف الجاكارندا كاتصبر للجفاف شوويه، فاش كانتضج، كأنقدر تصبر لنوبات ديل 4 تال 6 سيمانات بلاماء، ولكن فاش كانتون باقة صغيرة، كايخصها تستقي عللدوام باش تجدّر مزيان. لاكان السقي مستامر وقت الجفاف، الشجرة كاتكرت صحيحة وكاتنور بكترة. الارض اللي كاتفضل الجاكارندا تقدر تتغرس فَنَاع الارض اللي تصفيتها مزيانة، لا ماكانتش مدكوكة ولا مشبعة ما لا فايتة الحدّ فالطين ولّا الجير ولا السيلينا.
هادو تيّورات (pipes) اللي كايندازو ب آباق يحميو الجدع ديال الشجرة من الضرز.	هادو نيووات م (pipes) اللّي كابتدازو باق يحميو الجدع ديال الفنجرةَ من الشّورِ.	هذا نبروات (pipes) اللي كايدنزاو تابش يحميو الجدع ديل الشجرة من الضرر.
الشلوح و العرب ديال شمال افريقية، كان عندهم دور كبير فوصول الزنبوع لاوروبا من الاندلس، يّام كانو حاكمين طرف من صبانيا- ديال دابا- من القرن 8 تالقرن 15. بزّاف من هاد الاندلسيين كانو احسنن الغْلاما فعلم النبات فالعصور الوسطى.	الشلوح و العرب ديال شمال افريقية، كان عندهم دور كبير فوصول الزنبوع لاوروبا من الاندلس، يّام كانو حاكمين طرف من صبانيا- ديال دابا- من القرن 8 .تاللقرن 15.بزّاف من هاد الاندلسيين كانوا احسن الغُلاما فعلم النبات فالعصور الوسطى.	الشلوح و العرب ديال شمال افريقية، كان عندهم دور كبير فوصول الزنبوع لاوروبا من الاندلس، يام كانو حاكمين طرف من صبيانيا - ديل دابا - من القرن 8 تاللقرن 15. بزاف من هاد الاندلسيين كانو احسن الغلاما فعلم النبات فالعصور الوسطى.
شوف تيجان الشجر الوافية اللي كانو مستّفين فشارع الحسن التاني... كان الضل وطرقان البيكالات، و كان الواحد يقدر يتمشى ولّا يبيدل وهو مضلل و مضرݣ من حرّ الشمس.	شوف تيجان الشجر الوافية اللي كانو مستّفين فشارع الحسن الثاني ... كان الضل و طرقان البيكالات، و كان الواحد يقدر يتمشى ولّا يبيدل وهو مضلل و مضرّك من حزّ الشمس.	See the tree of the wavy tree that they have planted on the street. The secondحسن... It was the road and the bicycle path, and the first was the only one who could stand under the sun and be hidden from the sun.
☯ جاري البركاك بعض مرات كتلف مع هذ القوم، مثلا مؤخرا تلقيت واحد السيد، لاباس لاباس، قلت ليه اش خبار فلان، قال ليا مبقيتش كنهضر معاه، قلت ليه لاش؟ قال ليا كيوصل الهضرة، بركاك ،قلت ليه فحياش؟ قال ليا "حنا سوّلاناه على جارو فين خدام؟ زداق قالها ليك نتا فالبلاصة؟ شوفت فيه وقلت ليه راه اصلا انا للي سولتني، يعني راه خصك دبز مع راسك حيتاش نتا كتوصل الهضرة بلا مكتشعر وزيدون دبا شكون بركاك نتا ولا هو! وانا ننشد ليه هذ الابيات المرتجلة "نعيب التبركيك ، والعيب في البركاكة، وما لزمننا عيب سوى في البركاكة"	جاري البركاكبعض مرات كتلف مع هذ القوم، مثلا مؤخرا تلقيت واحد السيد، لاباس لاباس، قلت ليه اش خبار فلان، قال ليا مبقيتش كنهضر معاه، قلت ليه لاش؟ قال ليا كيوصل الهضرة، بركاك ،قلت ليه فحياش؟ قال ليا "حنا سوّلاناه على جارو فين خدام؟ زداق قالها ليك نتا فالبلاصة؟ شوفت فيه وقلت ليه راه اصلا انا للي سولتني، يعني راه خصك دبز مع راسك حيتاش نتا كتوصل الهضرة بلا مكتشعر وزيدون دبا شكلون بركاك نتا ولا هو!وانا نشد ليه هذ الابيات المرتجلة "نعيب التبركيك ، والعيب في البركاكة، وما لزمننا عيب سوى في البركاكة".	بعض مرات كتلف مع هذا القوم، مثلا مؤخرا تلقيت واحد السيد، لاباس لاباس، قلت ليه اش خبار فلان، قال ليها مبقيتش كنهضر معاه، قلت ليه لاش؟ قال ليها كيوصل الهضرة، بركاك، قلت ليه فحياش؟ قال ليها "حنا سولاناه على جارو فين خدام؟ زداق قالها ليك نتا فالبلاصة؟ شوفت فيه وقلت ليه راه اصلا اننا للي سولنتي، يعني راه خصك دبز مع راسك حيتاش نتا كتوصل الهضرة بلا مكتشر وزيدون دبا شكون بركاك نتا ولا هو! وانا ننشد ليه هذ الابيات المرتجلة "نعييب التبركيك، والعيب في البركاكا، وما لزمننا عيب سوى في البركاكا".
العربي حْنｂيپي العربي بَرْجِاجْ كيڤلّي الدريّات من الرّاس لْصّاݣ. كُل نّهار، صيف ؤشْتوآ، كيشرّي الجْوْرْنال وكيجْلس يتيفهوى فالشارع. قْهوة مهرّسة أْ جويرُو، أ نوْبة نوْبة كيهز عينيه ويخطف شويفة. خْتى حاجة ما غابية عليه، تبازك الله. طايز مع الطْيًور، أ كيعزف الدْجاج البلدي و الدجاج الرومي أخْتى لكزوّازي. هادي زوينة، هادي سْمينة، هادي مغْزبة، هادي مْحْببّة دايْراها سْبة، هادي مهْبطّة السقطة، هادي كريشتها غربيّانة بلا غطا. هادي عْزْبة بايْرة، هادي فصالة ما دايْرة، هادي سْبحانْ من خلقها، هادي غْادة	الثعلب العربيحُبِيبِي العربي بْرُجَاجْ كَيْفْلِي الذِّرِيَاتِ مَنْ الرَّاسَّ لْصََاجْ . كُلُّ نُهَارٌ، صِيفٌ وَشْتُوَا، كَيْشْرِي الْجُورْنَالْ وَكَيْجِلْسٌ يَقْهُوَى فَالشَّارِعْ . قُهْوَةَ مَهْرْسَةَ أُجْوِيرُو، أَنُوبَةَ نُوبَةَ كَيْهْرُ عِينِيهْ وَيُخْطَفْ شُوِيفَةَ . حَتَّى حَاجَةَ مَا غَابِيبَةَ عُلِيهْ، تُبَارِكَ اللَّهُ . طَائِرٌ مُعَ الطَّيُورُ، أَكَيْعَرْفْ الدِّجَاجْ الْبُلْدِي وَالدِّجَاجْ الرُّومِي أُحْتَى لْكَرْوَازِي . هَادِي زَوِينَةَ ، هَادِي سُمِينَةَ ، هَادِي مُقْرْبَةَ ، هَادِي مُحْتَجَبَةَ دَايْرَاهَا سُبْهَةَ ، هَادِي مُهْبِطَةَ السُّمْطَةَ ، هَادِي كُرِيشْتَهَا عُرْيَانَةَ بْلَا غَطَا ، هَادِي ****عُزْبَةَ بَايْرَةَ ، هَادِي فَصَالَةَ مَا دَايْرَةَ ، هَادِي سُبْحَانَ مَنْ خُلْقُهَا ، هَادِي غَادَةَ	العَرَبِيْ حَبَبي العَرَبِيْ بَرْجَاجْ كَفَّلِي الْدِّرَائِثْ مِنْ الرَّاسْ لَصَاجْ. كُلْ نَهَارْ، صِيفٌ وشَتَواْ كَيْشْرِي الجُورْنَالْ وَكَيْجِلسْ يَبْقَهْوَى فالشَّارِعْ. قَهْوَةٌ مَهْرَسَةٌ أَجُورُوْ، أَنْوَابَةٌ نُوبَةٌ كَهَرْ عِينِيْهْ وَيَحْطَفْ شَوْفِيْهْ. حَتَّى حَاجَةٌ مَا غَابِيَةٌ عَلِيْهْ، تَبَارَكَ اللَّهْ. طَايِرْ مَعْ الطَّيُورْ، أَكْتَعْرَفْ الدَّجَاجْ البَلْدِيْ وَ الدَّجَاجْ الرُّومِيْ أَحْتَى لَكْرْوَازِيْ. هَادِيْ زَوِيْنَةْ، هَادِيْ شَمِيدَةْ، هَادِيْ مَغْزَبَةْ، هَادِيْ مَحْتَجَبَةْ دَايْرَاهَا سَبَبَةْ، هَادِيْ مَهْبَطَةْ السَّمْقَةْ، هَادِيْ كَرِيشْتَهَا عَرْبِيَانَةْ بَلاْ غَطَاْ، هَادِيْ عَزْبَةْ بَايرَةْ، هَادِيْ فَصَالَةْ مَا دَايْرَةْ، هَادِيْ سَبْحَانْ مَنْ خَلْقَهَاْ، هَادِيْ غَادَةْ
الإسلام مدكور معا الزيتون فالقسم ديال الله، بفضل المنافع الكثيرة للصحة اللي فيه	الإسلام مذكور معا الزيتون فالقسم ديال الله، بفضل المنافع الكتيرة للصحة اللي فيه	الإسلام مذكور معًا الزيتون فالقسم دينال الله، بفضل المنافع الكثير للصحة اللي فيه

Conclusion

Limitations

Diacritics Handling: The system is primarily trained and evaluated on undiacritized text, and may not accurately recognize or reconstruct diacritics when present.
Complex Layouts: Although robust to many layouts, performance may degrade on highly complex or non-standard document structures.

Future Work

Dataset Expansion: Incorporate more handwritten notes, documents with full diacritics, and a wider variety of real-world document types to improve robustness and generalization.
Compact Model Development: Develop a smaller, more efficient version of AtlasOCR (targeting <3B parameters) for easier deployment on mobile and edge devices.
Layout Understanding: Improve the model’s ability to parse and extract information from documents with complex or unusual layouts.

Key Takeaways

AtlasOCR is the first open-source OCR model for Moroccan Darija, built on top of Qwen2.5-VL 3B.
A mix of synthetic and real data turned out to be the winning recipe: OCRSmith gave us scale, while scanned books, social media posts, and documents brought authenticity.
With QLoRA + Unsloth, we managed to fine-tune a 3B model efficiently on limited hardware, keeping training practical without losing accuracy.
Our experiments showed what works (and what doesn’t): higher LoRA ranks and unfreezing vision layers helped, while RSLoRA wasn’t useful in this case.
For Darija, character-level accuracy (CER) is a better measure than word-level, since spelling isn’t standardized.
We released AtlasOCRBench, the first benchmark tailored to Darija OCR, with both synthetic and human-validated data.
Even though it’s trained for Darija, AtlasOCR generalizes well to Arabic OCR tasks, performing close to much larger models.

References

Acknowledgments

Special thanks to Khaoula Alaoui Belghiti and Zaid Chiech for their invaluable help with annotation and the release of the project.

Call to action

AtlasIA is a Moroccan AI Community that builds open-source AI models and datasets for moroccan dialects.

🤝 To support AtlasIA, you can donate in one of the following platforms:

Get involved:

This blog post is part of our series on Moroccan Open Source AI. Check out our other posts on our website www.atlasia.ma.

Community

kenza-ily

about 1 month ago

Amazing work, well done!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

AtlasOCR: Building the First Open-Source Darija OCR Model with Vision Language Models

Introduction

Table of Contents

How to use AtlasOCR

Background

Why is this important?

Vision Language Models

Architecture in a nutshell

Data Curation

OCRSmith: a toolkit for synthetic OCR data generation

Real-World Data Sources

Scanned Books

Social Media Images

Educational Documents

Cookbooks

Dataset Overview

Training

Model Selection

Training Strategy

Ablation Studies

LoRA Hyperparameters (r, alpha, dropout)

Quantization Impact

Batch Size and Learning Rate

Vision Layer Freezing

RSLoRA

Evaluation

Benchmark

Benchmark Curation

Benchmark Composition

Limitations

Evaluation Metrics

Methodology

Results

Leaderboard

AtlasOCRBench

KITAB-Bench

Diverse Examples

Conclusion

Limitations

Future Work

Key Takeaways

References

Acknowledgments

Call to action

Community