scb10x
/

typhoon-ocr-7b

@@ -5,23 +5,65 @@ language:
 - th
 base_model:
 - Qwen/Qwen2.5-VL-7B-Instruct
 ---
-Typhoon OCR is an open-source, bilingual document parsing model built specifically for real-world documents in Thai and English. Inspired by models like olmOCR, Typhoon OCR introduces a redesigned architecture that is:
-Robust to noisy inputs and complex, irregular layouts
-Multilingual, with dedicated support for both Thai and English
-Layout-aware, preserving the document’s structural integrity in its output
-Unlike conventional OCR tools, Typhoon OCR doesn't just extract raw text—it produces semantic, structured, and layout-preserving outputs that are optimized for downstream tasks such as:
-Retrieval-Augmented Generation (RAG)
-Comprehensive document parsing and understanding
-Accurate interpretation of tables, charts, and forms

 - th
 base_model:
 - Qwen/Qwen2.5-VL-7B-Instruct
+tags:
+- OCR
 ---
+**Typhoon-OCR-7B**: A bilingual document parsing model built specifically for real-world documents in Thai and English inspired by models like olmOCR.
+## **Model Description**
+- **Model type**: A 7B Vision-Language Models (VLMs) model based on Qwen2.5-VL-Instruction.
+- **Requirement**: transformers 4.50.0 or newer.
+- **Primary Language(s)**: Thai 🇹🇭 and English 🇬🇧
+- **License**:
+## **Real-World Document Support**
+**1. Structured Documents**: Financial reports, Academic papers, Books, Government forms
+**Output format**:
+- Markdown for general text
+- HTML for tables (including merged cells and complex layouts)
+- Figures, charts, and diagrams are represented using figure tags for structured visual understanding
+**Each figure undergoes multi-layered interpretation**:
+- **Observation**: Detects elements like landscapes, buildings, people, logos, and embedded text
+- **Context Analysis**: Infers context such as location, event, or document section
+- **Text Recognition**: Extracts and interprets embedded text (e.g., chart labels, captions) in Thai or English
+- **Artistic & Structural Analysis**: Captures layout style, diagram type, or design choices contributing to document tone
+- **Final Summary**: Combines all insights into a structured figure description for tasks like summarization and retrieval
+**2. Layout-Heavy & Informal Documents**: Receipts, Menus papers, Tickets, Infographics
+**Output format**:
+- Markdown with embedded tables and layout-aware structures
+## Summary of Findings
+Typhoon OCR outperforms both GPT-4o and Gemini 2.5 Flash in Thai document understanding, particularly on documents with complex layouts and mixed-language content.
+However, in the Thai books benchmark, performance slightly declined due to the high frequency and diversity of embedded figures. These images vary significantly in type and structure, which poses challenges for our current figure tag parsing. This highlights a potential area for future improvement—specifically, in enhancing the model's image understanding capabilities.
+For this version, our primary focus has been on achieving high-quality OCR for both English and Thai text. Future releases may extend support to more advanced image analysis and figure interpretation.
+## Usage Example
+## **Citation**
+- If you find Typhoon2 useful for your work, please cite it using:
+```
+@misc{typhoon2,
+      title={Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Models},
+      author={Kunat Pipatanakul and Potsawee Manakul and Natapong Nitarach and Warit Sirichotedumrong and Surapon Nonesung and Teetouch Jaknamon and Parinthapat Pengpun and Pittawat Taveekitworachai and Adisai Na-Thalang and Sittipong Sripaisarnmongkol and Krisanapong Jirayoot and Kasima Tharnpipitchai},
+      year={2024},
+      eprint={2412.13702},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2412.13702},
+}
+```