File size: 6,177 Bytes
4746379 6ec73c3 0e35cc2 f1f55a1 0e35cc2 4cd14fd 8bea7b8 4cd14fd a5308d0 4cd14fd 6eba2e6 a5308d0 6eba2e6 f1f55a1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 |
---
license: apache-2.0
datasets:
- allenai/olmOCR-mix-0225
- prithivMLmods/Opendoc1-Analysis-Recognition
- prithivMLmods/Opendoc2-Analysis-Recognition
- prithivMLmods/Openpdf-Analysis-Recognition
pipeline_tag: image-text-to-text
language:
- en
base_model:
- Qwen/Qwen2-VL-7B-Instruct
library_name: transformers
tags:
- text-generation-inference
- OCR
- Pdf
- Doc
- Image
---

# **coreOCR-7B-050325-preview**
> The **coreOCR-7B-050325-preview** model is a fine-tuned version of **Qwen/Qwen2-VL-7B**, optimized for **Document-Level Optical Character Recognition (OCR)**, **long-context vision-language understanding**, and **accurate image-to-text conversion with mathematical LaTeX formatting**. Designed with a focus on high-fidelity visual-textual comprehension, this model enhances document parsing, structured data extraction, and complex visual reasoning.
# Key Enhancements
* **Advanced Document-Level OCR**: Accurately processes and extracts structured text from complex, multi-page documents including invoices, forms, and research papers.
* **Enhanced Long-Context Vision-Language Understanding**: Supports long-text retrieval and reasoning from documents and multimedia inputs, including dense text blocks, diagrams, and math content.
* **SoTA Understanding Across Image Resolutions**: Achieves state-of-the-art results on visual benchmarks including MathVista, DocVQA, RealWorldQA, and MTVQA.
* **Video Comprehension up to 20+ minutes**: Capable of high-quality video-based question answering, dialogue generation, and content summarization from long video sequences.
* **Device Control via Visual Commands**: With complex reasoning and perception capabilities, it can be integrated with devices like mobile phones or robots for visually grounded automation.
# Quick Start with Transformers
```python
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
model = Qwen2VLForConditionalGeneration.from_pretrained(
"prithivMLmods/coreOCR-7B-050325-preview", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("prithivMLmods/coreOCR-7B-050325-preview")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```
# Training Details
| Parameter | Value |
|-------------------------|----------------------------------------------------|
| **Dataset Size** | 274,209 samples (Modular Combination of Datasets) |
| **Model Architecture** | `Qwen2VLForConditionalGeneration` |
| **Hardware** | 2 × NVIDIA A100 SXM (with 32 vCPUs) |
| **Total Disk** | 160,000 MB |
| **Training Time** | 10,390 seconds (~2.88 hours) |
| **Learning Rate** | 1e-5 |
| **Scheduler** | Linear Decay |
| **Warmup Steps** | 700 |
| **Precision** | bfloat16 |
> [!note]
> The open dataset image-text response will be updated soon.
# Intended Use
This model is intended for:
* Document analysis and OCR from scanned images, PDFs, and camera input.
* Image-based question answering (e.g., educational content, diagrams, receipts).
* Math problem solving and LaTeX text generation from handwritten or printed math content.
* Long-context vision-text applications such as multi-slide document retrieval and dense information extraction.
* Multilingual OCR workflows for cross-lingual business documents and global data digitization.
* AI agents for mobile/robotic interaction through visual context.
# Limitations
* Performance may degrade on extremely noisy or low-resolution images.
* Not suitable for real-time inference on edge devices due to model size and memory demands.
* While multilingual, performance on low-resource or rare scripts may vary.
* Not optimized for high-speed processing of video streams in constrained environments.
* Contextual understanding depends on visual tokenization parameters; improper configuration may affect output quality.
* Outputs may occasionally include hallucinations or incomplete answers in long-context queries.
# References
- **DocVLM: Make Your VLM an Efficient Reader**
[https://arxiv.org/pdf/2412.08746v1](https://arxiv.org/pdf/2412.08746v1)
- **YaRN: Efficient Context Window Extension of Large Language Models**
[https://arxiv.org/pdf/2309.00071](https://arxiv.org/pdf/2309.00071)
- **Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution**
[https://arxiv.org/pdf/2409.12191](https://arxiv.org/pdf/2409.12191)
- **Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond**
[https://arxiv.org/pdf/2308.12966](https://arxiv.org/pdf/2308.12966)
- **A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy**
[https://arxiv.org/pdf/2412.02210](https://arxiv.org/pdf/2412.02210) |