README.md · prithivMLmods/docscopeOCR-7B-050425-exp at d63da39b01548b178e1b241e135d022538ab59c6

metadata

license: apache-2.0
datasets:
  - allenai/olmOCR-mix-0225
  - prithivMLmods/Opendoc1-Analysis-Recognition
  - prithivMLmods/Opendoc2-Analysis-Recognition
  - prithivMLmods/Openpdf-Analysis-Recognition
pipeline_tag: image-text-to-text

Training Details

Parameter	Value
Dataset Size	274,209 samples (Modular Combination of Datasets)
Model Architecture	`Qwen2_5_VLForConditionalGeneration`
Hardware	2 × NVIDIA A100 SXM (32 vCPUs)
Total Disk	170,000 MB
Training Time	9,020 seconds (~2.51 hours)
Learning Rate	1e-5
Scheduler	Linear Decay
Warmup Steps	750
Precision	bfloat16

The open dataset image-text response will be updated soon.

References

DocVLM: Make Your VLM an Efficient Reader https://arxiv.org/pdf/2412.08746v1
YaRN: Efficient Context Window Extension of Large Language Models
https://arxiv.org/pdf/2309.00071
Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution
https://arxiv.org/pdf/2409.12191
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
https://arxiv.org/pdf/2308.12966
A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy https://arxiv.org/pdf/2412.02210