---
license: apache-2.0
datasets:
  - allenai/olmOCR-mix-0225
  - prithivMLmods/Opendoc1-Analysis-Recognition
  - prithivMLmods/Opendoc2-Analysis-Recognition
  - prithivMLmods/Openpdf-Analysis-Recognition
pipeline_tag: image-text-to-text
---

![22.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/_L1v41LZYfOQCLLwHtAEy.png)

## Training Details

| Parameter               | Value                                               |
|-------------------------|-----------------------------------------------------|
| **Dataset Size**        | 274,209 samples (Modular Combination of Datasets)   |
| **Model Architecture**  | `Qwen2_5_VLForConditionalGeneration`                |
| **Hardware**            | 2 × NVIDIA A100 SXM (32 vCPUs)                      |
| **Total Disk**          | 170,000 MB                                          |
| **Training Time**       | 9,020 seconds (~2.51 hours)                         |
| **Learning Rate**       | 1e-5                                                |
| **Scheduler**           | Linear Decay                                        |
| **Warmup Steps**        | 750                                                 |
| **Precision**           | bfloat16                                            |

> [!note]
> The open dataset image-text response will be updated soon.

## References

- **DocVLM: Make Your VLM an Efficient Reader** 
  [https://arxiv.org/pdf/2412.08746v1](https://arxiv.org/pdf/2412.08746v1)

- **YaRN: Efficient Context Window Extension of Large Language Models**  
  [https://arxiv.org/pdf/2309.00071](https://arxiv.org/pdf/2309.00071)

- **Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution**  
  [https://arxiv.org/pdf/2409.12191](https://arxiv.org/pdf/2409.12191)

- **Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond**  
  [https://arxiv.org/pdf/2308.12966](https://arxiv.org/pdf/2308.12966)

- **A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy**
  [https://arxiv.org/pdf/2412.02210](https://arxiv.org/pdf/2412.02210)