File size: 6,177 Bytes
4746379
 
 
 
 
 
 
6ec73c3
0e35cc2
 
 
f1f55a1
0e35cc2
 
 
 
 
 
 
4cd14fd
 
8bea7b8
4cd14fd
a5308d0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4cd14fd
 
 
 
 
 
 
 
 
 
 
 
 
 
6eba2e6
 
a5308d0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6eba2e6
 
 
 
 
 
 
 
 
 
 
 
 
 
f1f55a1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
license: apache-2.0
datasets:
- allenai/olmOCR-mix-0225
- prithivMLmods/Opendoc1-Analysis-Recognition
- prithivMLmods/Opendoc2-Analysis-Recognition
- prithivMLmods/Openpdf-Analysis-Recognition
pipeline_tag: image-text-to-text
language:
- en
base_model:
- Qwen/Qwen2-VL-7B-Instruct
library_name: transformers
tags:
- text-generation-inference
- OCR
- Pdf
- Doc
- Image
---

![11.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/06COvqws8RSPQVm51EQgh.png)

# **coreOCR-7B-050325-preview**

> The **coreOCR-7B-050325-preview** model is a fine-tuned version of **Qwen/Qwen2-VL-7B**, optimized for **Document-Level Optical Character Recognition (OCR)**, **long-context vision-language understanding**, and **accurate image-to-text conversion with mathematical LaTeX formatting**. Designed with a focus on high-fidelity visual-textual comprehension, this model enhances document parsing, structured data extraction, and complex visual reasoning.

# Key Enhancements

* **Advanced Document-Level OCR**: Accurately processes and extracts structured text from complex, multi-page documents including invoices, forms, and research papers.

* **Enhanced Long-Context Vision-Language Understanding**: Supports long-text retrieval and reasoning from documents and multimedia inputs, including dense text blocks, diagrams, and math content.

* **SoTA Understanding Across Image Resolutions**: Achieves state-of-the-art results on visual benchmarks including MathVista, DocVQA, RealWorldQA, and MTVQA.

* **Video Comprehension up to 20+ minutes**: Capable of high-quality video-based question answering, dialogue generation, and content summarization from long video sequences.

* **Device Control via Visual Commands**: With complex reasoning and perception capabilities, it can be integrated with devices like mobile phones or robots for visually grounded automation.

# Quick Start with Transformers

```python
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/coreOCR-7B-050325-preview", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("prithivMLmods/coreOCR-7B-050325-preview")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```

# Training Details

| Parameter               | Value                                              |
|-------------------------|----------------------------------------------------|
| **Dataset Size**        | 274,209 samples (Modular Combination of Datasets)  |
| **Model Architecture**  | `Qwen2VLForConditionalGeneration`                  |
| **Hardware**            | 2 × NVIDIA A100 SXM (with 32 vCPUs)                |
| **Total Disk**          | 160,000 MB                                         |
| **Training Time**       | 10,390 seconds (~2.88 hours)                       |
| **Learning Rate**       | 1e-5                                               |
| **Scheduler**           | Linear Decay                                       |
| **Warmup Steps**        | 700                                                |
| **Precision**           | bfloat16                                           |

> [!note]
> The open dataset image-text response will be updated soon.

# Intended Use

This model is intended for:

* Document analysis and OCR from scanned images, PDFs, and camera input.
* Image-based question answering (e.g., educational content, diagrams, receipts).
* Math problem solving and LaTeX text generation from handwritten or printed math content.
* Long-context vision-text applications such as multi-slide document retrieval and dense information extraction.
* Multilingual OCR workflows for cross-lingual business documents and global data digitization.
* AI agents for mobile/robotic interaction through visual context.

# Limitations

* Performance may degrade on extremely noisy or low-resolution images.
* Not suitable for real-time inference on edge devices due to model size and memory demands.
* While multilingual, performance on low-resource or rare scripts may vary.
* Not optimized for high-speed processing of video streams in constrained environments.
* Contextual understanding depends on visual tokenization parameters; improper configuration may affect output quality.
* Outputs may occasionally include hallucinations or incomplete answers in long-context queries.

# References

- **DocVLM: Make Your VLM an Efficient Reader** 
  [https://arxiv.org/pdf/2412.08746v1](https://arxiv.org/pdf/2412.08746v1)

- **YaRN: Efficient Context Window Extension of Large Language Models**  
  [https://arxiv.org/pdf/2309.00071](https://arxiv.org/pdf/2309.00071)

- **Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution**  
  [https://arxiv.org/pdf/2409.12191](https://arxiv.org/pdf/2409.12191)

- **Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond**  
  [https://arxiv.org/pdf/2308.12966](https://arxiv.org/pdf/2308.12966)

- **A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy**
  [https://arxiv.org/pdf/2412.02210](https://arxiv.org/pdf/2412.02210)