coreOCR-7B-050325-preview
The coreOCR-7B-050325-preview model is a fine-tuned version of Qwen/Qwen2-VL-7B, optimized for Document-Level Optical Character Recognition (OCR), long-context vision-language understanding, and accurate image-to-text conversion with mathematical LaTeX formatting. Designed with a focus on high-fidelity visual-textual comprehension, this model enhances document parsing, structured data extraction, and complex visual reasoning.
Key Enhancements
Advanced Document-Level OCR: Accurately processes and extracts structured text from complex, multi-page documents including invoices, forms, and research papers.
Enhanced Long-Context Vision-Language Understanding: Supports long-text retrieval and reasoning from documents and multimedia inputs, including dense text blocks, diagrams, and math content.
SoTA Understanding Across Image Resolutions: Achieves state-of-the-art results on visual benchmarks including MathVista, DocVQA, RealWorldQA, and MTVQA.
Video Comprehension up to 20+ minutes: Capable of high-quality video-based question answering, dialogue generation, and content summarization from long video sequences.
Device Control via Visual Commands: With complex reasoning and perception capabilities, it can be integrated with devices like mobile phones or robots for visually grounded automation.
Quick Start with Transformers
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
model = Qwen2VLForConditionalGeneration.from_pretrained(
"prithivMLmods/coreOCR-7B-050325-preview", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("prithivMLmods/coreOCR-7B-050325-preview")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Training Details
Parameter | Value |
---|---|
Dataset Size | 274,209 samples (Modular Combination of Datasets) |
Model Architecture | Qwen2VLForConditionalGeneration |
Hardware | 2 × NVIDIA A100 SXM (with 32 vCPUs) |
Total Disk | 160,000 MB |
Training Time | 10,390 seconds (~2.88 hours) |
Learning Rate | 1e-5 |
Scheduler | Linear Decay |
Warmup Steps | 700 |
Precision | bfloat16 |
The open dataset image-text response will be updated soon.
Intended Use
This model is intended for:
- Document analysis and OCR from scanned images, PDFs, and camera input.
- Image-based question answering (e.g., educational content, diagrams, receipts).
- Math problem solving and LaTeX text generation from handwritten or printed math content.
- Long-context vision-text applications such as multi-slide document retrieval and dense information extraction.
- Multilingual OCR workflows for cross-lingual business documents and global data digitization.
- AI agents for mobile/robotic interaction through visual context.
Limitations
- Performance may degrade on extremely noisy or low-resolution images.
- Not suitable for real-time inference on edge devices due to model size and memory demands.
- While multilingual, performance on low-resource or rare scripts may vary.
- Not optimized for high-speed processing of video streams in constrained environments.
- Contextual understanding depends on visual tokenization parameters; improper configuration may affect output quality.
- Outputs may occasionally include hallucinations or incomplete answers in long-context queries.
References
DocVLM: Make Your VLM an Efficient Reader https://arxiv.org/pdf/2412.08746v1
YaRN: Efficient Context Window Extension of Large Language Models
https://arxiv.org/pdf/2309.00071Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution
https://arxiv.org/pdf/2409.12191Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
https://arxiv.org/pdf/2308.12966A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy https://arxiv.org/pdf/2412.02210
- Downloads last month
- 0
Model tree for prithivMLmods/coreOCR-7B-050325-preview
Base model
Qwen/Qwen2-VL-7B