prithivMLmods commited on
Commit
b8584cc
·
verified ·
1 Parent(s): d63da39

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +101 -5
README.md CHANGED
@@ -1,15 +1,90 @@
1
  ---
2
  license: apache-2.0
3
  datasets:
4
- - allenai/olmOCR-mix-0225
5
- - prithivMLmods/Opendoc1-Analysis-Recognition
6
- - prithivMLmods/Opendoc2-Analysis-Recognition
7
- - prithivMLmods/Openpdf-Analysis-Recognition
8
  pipeline_tag: image-text-to-text
 
 
 
 
 
 
 
 
 
 
 
9
  ---
10
 
11
  ![22.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/_L1v41LZYfOQCLLwHtAEy.png)
12
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  ## Training Details
14
 
15
  | Parameter | Value |
@@ -27,6 +102,27 @@ pipeline_tag: image-text-to-text
27
  > [!note]
28
  > The open dataset image-text response will be updated soon.
29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
  ## References
31
 
32
  - **DocVLM: Make Your VLM an Efficient Reader**
@@ -42,4 +138,4 @@ pipeline_tag: image-text-to-text
42
  [https://arxiv.org/pdf/2308.12966](https://arxiv.org/pdf/2308.12966)
43
 
44
  - **A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy**
45
- [https://arxiv.org/pdf/2412.02210](https://arxiv.org/pdf/2412.02210)
 
1
  ---
2
  license: apache-2.0
3
  datasets:
4
+ - allenai/olmOCR-mix-0225
5
+ - prithivMLmods/Opendoc1-Analysis-Recognition
6
+ - prithivMLmods/Opendoc2-Analysis-Recognition
7
+ - prithivMLmods/Openpdf-Analysis-Recognition
8
  pipeline_tag: image-text-to-text
9
+ tags:
10
+ - OCR
11
+ - Pdf
12
+ - Doc
13
+ - Image
14
+ - text-generation-inference
15
+ language:
16
+ - en
17
+ base_model:
18
+ - Qwen/Qwen2.5-VL-7B-Instruct
19
+ library_name: transformers
20
  ---
21
 
22
  ![22.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/_L1v41LZYfOQCLLwHtAEy.png)
23
 
24
+ # **docscopeOCR-7B-050425-exp**
25
+
26
+ > The **docscopeOCR-7B-050425-exp** model is a fine-tuned version of **Qwen/Qwen2.5-VL-7B-Instruct**, optimized for **Document-Level Optical Character Recognition (OCR)**, **long-context vision-language understanding**, and **accurate image-to-text conversion with mathematical LaTeX formatting**. Built on top of the Qwen2.5-VL architecture, this model significantly improves document comprehension, structured data extraction, and visual reasoning across diverse input formats.
27
+
28
+ # Key Enhancements
29
+
30
+ * **Advanced Document-Level OCR**: Capable of extracting structured content from complex, multi-page documents such as invoices, academic papers, forms, and scanned reports.
31
+
32
+ * **Enhanced Long-Context Vision-Language Understanding**: Designed to handle dense document layouts, long sequences of embedded text, tables, and diagrams with coherent cross-reference understanding.
33
+
34
+ * **State-of-the-Art Performance Across Resolutions**: Achieves competitive results on OCR and visual QA benchmarks such as DocVQA, MathVista, RealWorldQA, and MTVQA.
35
+
36
+ * **Video Understanding up to 20+ minutes**: Supports detailed comprehension of long-duration videos for content summarization, Q\&A, and multi-modal reasoning.
37
+
38
+ * **Visually-Grounded Device Interaction**: Enables mobile/robotic device operation via visual inputs and text-based instructions using contextual understanding and decision-making logic.
39
+
40
+ # Quick Start with Transformers
41
+
42
+ ```python
43
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
44
+ from qwen_vl_utils import process_vision_info
45
+
46
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
47
+ "prithivMLmods/docscopeOCR-7B-050425-exp", torch_dtype="auto", device_map="auto"
48
+ )
49
+
50
+ processor = AutoProcessor.from_pretrained("prithivMLmods/docscopeOCR-7B-050425-exp")
51
+
52
+ messages = [
53
+ {
54
+ "role": "user",
55
+ "content": [
56
+ {
57
+ "type": "image",
58
+ "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
59
+ },
60
+ {"type": "text", "text": "Describe this image."},
61
+ ],
62
+ }
63
+ ]
64
+
65
+ text = processor.apply_chat_template(
66
+ messages, tokenize=False, add_generation_prompt=True
67
+ )
68
+ image_inputs, video_inputs = process_vision_info(messages)
69
+ inputs = processor(
70
+ text=[text],
71
+ images=image_inputs,
72
+ videos=video_inputs,
73
+ padding=True,
74
+ return_tensors="pt",
75
+ )
76
+ inputs = inputs.to("cuda")
77
+
78
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
79
+ generated_ids_trimmed = [
80
+ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
81
+ ]
82
+ output_text = processor.batch_decode(
83
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
84
+ )
85
+ print(output_text)
86
+ ```
87
+
88
  ## Training Details
89
 
90
  | Parameter | Value |
 
102
  > [!note]
103
  > The open dataset image-text response will be updated soon.
104
 
105
+ # Intended Use
106
+
107
+ This model is intended for:
108
+
109
+ * High-fidelity OCR from documents, forms, receipts, and printed or scanned materials.
110
+ * Image and document-based question answering for educational and enterprise applications.
111
+ * Extraction and LaTeX formatting of mathematical expressions from printed or handwritten content.
112
+ * Retrieval and summarization from long documents, slides, and multi-modal inputs.
113
+ * Multilingual OCR and structured content extraction for global use cases.
114
+ * Robotic or mobile automation with vision-guided contextual interaction.
115
+
116
+ # Limitations
117
+
118
+ * May show degraded performance on extremely low-quality or occluded images.
119
+ * Not optimized for real-time applications on low-resource or edge devices due to computational demands.
120
+ * Variable accuracy on uncommon or low-resource languages/scripts.
121
+ * Long video processing may require substantial memory and is not optimized for streaming applications.
122
+ * Visual token settings affect performance; suboptimal configurations can impact results.
123
+ * In rare cases, outputs may contain hallucinated or contextually misaligned information.
124
+
125
+
126
  ## References
127
 
128
  - **DocVLM: Make Your VLM an Efficient Reader**
 
138
  [https://arxiv.org/pdf/2308.12966](https://arxiv.org/pdf/2308.12966)
139
 
140
  - **A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy**
141
+ [https://arxiv.org/pdf/2412.02210](https://arxiv.org/pdf/2412.02210)