--- license: apache-2.0 datasets: - allenai/olmOCR-mix-0225 - prithivMLmods/Opendoc1-Analysis-Recognition - prithivMLmods/Opendoc2-Analysis-Recognition - prithivMLmods/Openpdf-Analysis-Recognition pipeline_tag: image-text-to-text --- ![22.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/_L1v41LZYfOQCLLwHtAEy.png) ## Training Details | Parameter | Value | |-------------------------|-----------------------------------------------------| | **Dataset Size** | 274,209 samples (Modular Combination of Datasets) | | **Model Architecture** | `Qwen2_5_VLForConditionalGeneration` | | **Hardware** | 2 × NVIDIA A100 SXM (32 vCPUs) | | **Total Disk** | 170,000 MB | | **Training Time** | 9,020 seconds (~2.51 hours) | | **Learning Rate** | 1e-5 | | **Scheduler** | Linear Decay | | **Warmup Steps** | 750 | | **Precision** | bfloat16 | > [!note] > The open dataset image-text response will be updated soon. ## References - **DocVLM: Make Your VLM an Efficient Reader** [https://arxiv.org/pdf/2412.08746v1](https://arxiv.org/pdf/2412.08746v1) - **YaRN: Efficient Context Window Extension of Large Language Models** [https://arxiv.org/pdf/2309.00071](https://arxiv.org/pdf/2309.00071) - **Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution** [https://arxiv.org/pdf/2409.12191](https://arxiv.org/pdf/2409.12191) - **Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond** [https://arxiv.org/pdf/2308.12966](https://arxiv.org/pdf/2308.12966) - **A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy** [https://arxiv.org/pdf/2412.02210](https://arxiv.org/pdf/2412.02210)