File size: 5,754 Bytes
aac530a
d7f94f1
7f5101f
9526f92
7f5101f
 
 
 
 
aac530a
 
d64bf11
aac530a
d64bf11
aac530a
d64bf11
aac530a
35eba56
 
 
 
d64bf11
aac530a
d64bf11
 
 
 
aac530a
d64bf11
aac530a
d64bf11
 
 
 
 
aac530a
421ea47
5e01d8d
 
421ea47
d64bf11
aac530a
d64bf11
 
 
 
aac530a
 
5e01d8d
421ea47
 
ea280e6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
aac530a
d64bf11
aac530a
 
247a39d
aac530a
7f5101f
 
 
 
 
aac530a
7f5101f
 
 
 
 
 
 
 
 
79a851d
7f5101f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
aac530a
d64bf11
aac530a
7f5101f
d64bf11
aac530a
ea280e6
 
 
 
 
 
aac530a
 
 
d64bf11
aac530a
d64bf11
aac530a
d64bf11
ea280e6
 
aac530a
d64bf11
aac530a
d64bf11
ea280e6
 
aac530a
d64bf11
aac530a
d64bf11
aac530a
d64bf11
 
ea280e6
d64bf11
 
 
79a851d
d64bf11
 
aac530a
d64bf11
aac530a
7f5101f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
---
library_name: peft
base_model:
- unsloth/Qwen2-VL-2B-Instruct-unsloth-bnb-4bit
pipeline_tag: image-text-to-text
tags:
- ocr
- urdu
- qwen2vl
---

# Qaari 0.1 Urdu: OCR Model for Urdu Language

## Model Description

Qaari 0.1 Urdu is a fine-tuned version of [Qwen/Qwen2-VL-2B](https://huggingface.co/Qwen/Qwen2-VL-2B) specifically optimized for Optical Character Recognition (OCR) of Urdu text. It represents a significant advancement in Urdu language OCR capabilities, dramatically outperforming both the base model and traditional OCR solutions like Tesseract.


![image/png](https://cdn-uploads.huggingface.co/production/uploads/630535e0c7fed54edfaa1a75/mTNZl3lvqcsboWdRkkEWk.png)


## Key Features

- **Specialized for Urdu OCR**: Optimized for recognizing Urdu script with high accuracy
- **Superior Performance**: Achieves 97.35% reduction in Word Error Rate compared to the base model
- **High Accuracy**: 0.048 WER and 0.029 CER, with a BLEU score of 0.916
- **Balanced Output Length**: Near-perfect length ratio of 0.978 (ideal is 1.0)

## Performance Metrics

| Model | WER ↓ | CER ↓ | BLEU ↑ | Length Ratio |
|-------|-------|-------|--------|--------------|
| **Qaari 0.1 Urdu** | **0.048** | **0.029** | **0.916** | **0.978** |
| Tesseract | 0.352 | 0.227 | 0.518 | 0.770 |
| Qwen Base | 1.823 | 1.739 | 0.009 | 1.288 |


<img src="https://cdn-uploads.huggingface.co/production/uploads/630535e0c7fed54edfaa1a75/2WTrdDg0MZ9MyDmY1zH97.png" width="600px"/>
<img src="https://cdn-uploads.huggingface.co/production/uploads/630535e0c7fed54edfaa1a75/ywEziefOQh58NHH4AaOIE.png" width="600px"/>

### Improvement Percentages

| Comparison | WER Improvement | CER Improvement | BLEU Improvement |
|------------|-----------------|-----------------|------------------|
| vs. Qwen Base | 97.35% | 98.32% | 91.55% |
| vs. Tesseract | 86.25% | 87.11% | 82.60% |


<img src="https://cdn-uploads.huggingface.co/production/uploads/630535e0c7fed54edfaa1a75/_736kX6GQMkOgf22BVPgS.png" width="600px"/>


## Supported Fonts
The model was fine-tuned on the following fonts:
- AlQalam Taj Nastaleeq Regular
- Alvi Nastaleeq Regular
- Gandhara Suls Regular
- Jameel Noori Nastaleeq Regular
- NotoNastaliqUrdu-Regular

## Supported Font Sizes
The model has been tested and optimized for the following font sizes:
- 14pt
- 16pt
- 18pt
- 20pt
- 24pt
- 32pt
- 40pt


## Usage


[Try Qaari - Google Colab](https://colab.research.google.com/github/Oddadmix/notebooks/blob/main/Qaari_0_1_Urdu.ipynb)

You can load this model using the `transformers` and `qwen_vl_utils` library:
```
!pip install transformers qwen_vl_utils accelerate>=0.26.0 PEFT -U
!pip install -U bitsandbytes
```

```python
from PIL import Image
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
import torch
import os
from qwen_vl_utils import process_vision_info



model_name = "oddadmix/Qaari-0.1-Urdu-OCR-VL-2B-Instruct"
model = Qwen2VLForConditionalGeneration.from_pretrained(
                model_name,
                torch_dtype="auto",
                device_map="auto"
            )
processor = AutoProcessor.from_pretrained(model_name)
max_tokens = 2000

prompt = "Below is the image of one page of a document, as well as some raw textual content that was previously extracted for it. Just return the plain text representation of this document as if you were reading it naturally. Do not hallucinate."
image.save("image.png")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": f"file://{src}"},
            {"type": "text", "text": prompt},
        ],
    }
]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=max_tokens)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
os.remove(src)
print(output_text)

```


## Limitations

- Performance may degrade when using fonts not included in the fine-tuning dataset
- Font sizes outside the supported range may result in suboptimal rendering
- The model may not handle complex ligatures in non-Nastaleeq scripts effectively
- Performance on digital-only displays has not been fully optimized
- Low-resolution print environments might experience quality degradation
- Custom font modifications or non-standard Nastaleeq variants might not render as expected

## Training Details

This model was fine-tuned from Qwen2-VL-2B using a dataset of Urdu text images with paired transcriptions. The training process focused on optimizing for accurate Urdu character recognition and natural language understanding.

### Training Dataset

- **Dataset Type**: Paired Urdu text images with ground truth transcriptions
- **Size**: 10,000
- **Source**: Syntehtic Dataset

### Training Configuration

- **Base Model**: Qwen/Qwen2-VL-2B
- **Hardware**: A6000 GPU 
- **Training Time**: 24 Hours

## Citation

If you use this model in your research, please cite:

```
@misc{qaari-0.1-urdu,
  author = {Ahmed Wasfy},
  title = {Qaari 0.1 Urdu: OCR Model for Urdu Language},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/oddadmix/Qaari-0.1-Urdu-OCR-VL-2B-Instruct}}
}
```

## License

This model is subject to the [license terms](https://huggingface.co/Qwen/Qwen2-VL-2B/blob/main/LICENSE) of the base Qwen2-VL-2B model.