metadata
license: apache-2.0
datasets:
- allenai/olmOCR-mix-0225
- prithivMLmods/Opendoc1-Analysis-Recognition
- prithivMLmods/Opendoc2-Analysis-Recognition
- prithivMLmods/Openpdf-Analysis-Recognition
pipeline_tag: image-text-to-text
Training Details
Parameter | Value |
---|---|
Dataset Size | 274,209 samples (Modular Combination of Datasets) |
Model Architecture | Qwen2_5_VLForConditionalGeneration |
Hardware | 2 × NVIDIA A100 SXM (32 vCPUs) |
Total Disk | 170,000 MB |
Training Time | 9,020 seconds (~2.51 hours) |
Learning Rate | 1e-5 |
Scheduler | Linear Decay |
Warmup Steps | 750 |
Precision | bfloat16 |
The open dataset image-text response will be updated soon.
References
DocVLM: Make Your VLM an Efficient Reader https://arxiv.org/pdf/2412.08746v1
YaRN: Efficient Context Window Extension of Large Language Models
https://arxiv.org/pdf/2309.00071Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution
https://arxiv.org/pdf/2409.12191Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
https://arxiv.org/pdf/2308.12966A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy https://arxiv.org/pdf/2412.02210