File size: 6,202 Bytes
3926292
 
 
b8584cc
 
 
 
3926292
b8584cc
 
 
 
 
 
 
 
 
 
 
3926292
c42642f
d63da39
9aa44b5
b8584cc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c42642f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29bb90f
 
b8584cc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29bb90f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b8584cc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
---
license: apache-2.0
datasets:
- allenai/olmOCR-mix-0225
- prithivMLmods/Opendoc1-Analysis-Recognition
- prithivMLmods/Opendoc2-Analysis-Recognition
- prithivMLmods/Openpdf-Analysis-Recognition
pipeline_tag: image-text-to-text
tags:
- OCR
- Pdf
- Doc
- Image
- text-generation-inference
language:
- en
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
library_name: transformers
---

![22.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/_L1v41LZYfOQCLLwHtAEy.png)

# **docscopeOCR-7B-050425-exp**

> The **docscopeOCR-7B-050425-exp** model is a fine-tuned version of **Qwen/Qwen2.5-VL-7B-Instruct**, optimized for **Document-Level Optical Character Recognition (OCR)**, **long-context vision-language understanding**, and **accurate image-to-text conversion with mathematical LaTeX formatting**. Built on top of the Qwen2.5-VL architecture, this model significantly improves document comprehension, structured data extraction, and visual reasoning across diverse input formats.

# Key Enhancements

* **Advanced Document-Level OCR**: Capable of extracting structured content from complex, multi-page documents such as invoices, academic papers, forms, and scanned reports.

* **Enhanced Long-Context Vision-Language Understanding**: Designed to handle dense document layouts, long sequences of embedded text, tables, and diagrams with coherent cross-reference understanding.

* **State-of-the-Art Performance Across Resolutions**: Achieves competitive results on OCR and visual QA benchmarks such as DocVQA, MathVista, RealWorldQA, and MTVQA.

* **Video Understanding up to 20+ minutes**: Supports detailed comprehension of long-duration videos for content summarization, Q\&A, and multi-modal reasoning.

* **Visually-Grounded Device Interaction**: Enables mobile/robotic device operation via visual inputs and text-based instructions using contextual understanding and decision-making logic.

# Quick Start with Transformers

```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/docscopeOCR-7B-050425-exp", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("prithivMLmods/docscopeOCR-7B-050425-exp")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```

## Training Details

| Parameter               | Value                                               |
|-------------------------|-----------------------------------------------------|
| **Dataset Size**        | 274,209 samples (Modular Combination of Datasets)   |
| **Model Architecture**  | `Qwen2_5_VLForConditionalGeneration`                |
| **Hardware**            | 2 × NVIDIA A100 SXM (32 vCPUs)                      |
| **Total Disk**          | 170,000 MB                                          |
| **Training Time**       | 9,020 seconds (~2.51 hours)                         |
| **Learning Rate**       | 1e-5                                                |
| **Scheduler**           | Linear Decay                                        |
| **Warmup Steps**        | 750                                                 |
| **Precision**           | bfloat16                                            |

> [!note]
> The open dataset image-text response will be updated soon.

# Intended Use

This model is intended for:

* High-fidelity OCR from documents, forms, receipts, and printed or scanned materials.
* Image and document-based question answering for educational and enterprise applications.
* Extraction and LaTeX formatting of mathematical expressions from printed or handwritten content.
* Retrieval and summarization from long documents, slides, and multi-modal inputs.
* Multilingual OCR and structured content extraction for global use cases.
* Robotic or mobile automation with vision-guided contextual interaction.

# Limitations

* May show degraded performance on extremely low-quality or occluded images.
* Not optimized for real-time applications on low-resource or edge devices due to computational demands.
* Variable accuracy on uncommon or low-resource languages/scripts.
* Long video processing may require substantial memory and is not optimized for streaming applications.
* Visual token settings affect performance; suboptimal configurations can impact results.
* In rare cases, outputs may contain hallucinated or contextually misaligned information.


## References

- **DocVLM: Make Your VLM an Efficient Reader** 
  [https://arxiv.org/pdf/2412.08746v1](https://arxiv.org/pdf/2412.08746v1)

- **YaRN: Efficient Context Window Extension of Large Language Models**  
  [https://arxiv.org/pdf/2309.00071](https://arxiv.org/pdf/2309.00071)

- **Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution**  
  [https://arxiv.org/pdf/2409.12191](https://arxiv.org/pdf/2409.12191)

- **Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond**  
  [https://arxiv.org/pdf/2308.12966](https://arxiv.org/pdf/2308.12966)

- **A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy**
  [https://arxiv.org/pdf/2412.02210](https://arxiv.org/pdf/2412.02210)