--- license: apache-2.0 language: - en datasets: - mychen76/invoices-and-receipts_ocr_v1 - unsloth/LaTeX_OCR - prithivMLmods/Latex-KIE base_model: - Qwen/Qwen2-VL-2B-Instruct pipeline_tag: image-text-to-text library_name: transformers tags: - text-generation-inference - image-caption - mini - art explain - visual report generation - photo captions - cutlines - qwen2 - inscription subtitle - representation --- ![2.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/yUKVKSX2E18k0h3YwCx1h.png) # **Imgscope-OCR-2B-0527** > The **Imgscope-OCR-2B-0527** model is a fine-tuned version of *Qwen2-VL-2B-Instruct*, specifically optimized for *messy handwriting recognition*, *document OCR*, *realistic handwritten OCR*, and *math problem solving with LaTeX formatting*. This model is trained on custom datasets for document and handwriting OCR tasks and integrates a conversational approach with strong visual and textual understanding for multi-modal applications. > [!note] Colab Demo : https://huggingface.co/prithivMLmods/Imgscope-OCR-2B-0527/blob/main/Imgscope%20OCR%202B%200527%20Demo/Imgscope-OCR-2B-0527.ipynb > [!note] Video Understanding Demo : https://huggingface.co/prithivMLmods/Imgscope-OCR-2B-0527/blob/main/Imgscope-OCR-2B-05270-Video-Understanding/Imgscope-OCR-2B-0527-Video-Understanding.ipynb --- ### Key Enhancements * **SoTA Understanding of Images of Various Resolution & Ratio** Imgscope-OCR-2B-0527 achieves state-of-the-art performance on visual understanding benchmarks such as MathVista, DocVQA, RealWorldQA, and MTVQA. * **Enhanced Handwriting OCR** Specifically optimized for recognizing and interpreting **realistic and messy handwriting** with high accuracy. Ideal for digitizing handwritten documents and notes. * **Document OCR Fine-Tuning** Fine-tuned with curated and realistic **document OCR datasets**, enabling accurate extraction of text from various structured and unstructured layouts. * **Understanding Videos of 20+ Minutes** Capable of processing long videos for **video-based question answering**, **transcription**, and **content generation**. * **Device Control Agent** Supports decision-making and control capabilities for integration with **mobile devices**, **robots**, and **automation systems** using visual-textual commands. * **Multilingual OCR Support** In addition to English and Chinese, the model supports **OCR in multiple languages** including European languages, Japanese, Korean, Arabic, and Vietnamese. --- ### How to Use ```python from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor from qwen_vl_utils import process_vision_info # Load the model model = Qwen2VLForConditionalGeneration.from_pretrained( "prithivMLmods/Imgscope-OCR-2B-0527", # replace with updated model ID if available torch_dtype="auto", device_map="auto" ) # Optional: Flash Attention for performance optimization # model = Qwen2VLForConditionalGeneration.from_pretrained( # "prithivMLmods/Imgscope-OCR-2B-0527", # torch_dtype=torch.bfloat16, # attn_implementation="flash_attention_2", # device_map="auto", # ) # Load processor processor = AutoProcessor.from_pretrained("prithivMLmods/Imgscope-OCR-2B-0527") messages = [ { "role": "user", "content": [ { "type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", }, {"type": "text", "text": "Recognize the handwriting in this image."}, ], } ] # Prepare input text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to("cuda") # Generate output generated_ids = model.generate(**inputs, max_new_tokens=128) generated_ids_trimmed = [ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text) ``` --- ### Demo Inference ![Screenshot 2025-05-27 at 03-40-34 Gradio.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/9KiRkOGPB8cLl6VHwh2UD.png) ![Screenshot 2025-05-27 at 03-40-56 (anonymous) - output_e0fbfa20-686e-4bce-b2e8-25991be5a5a0.pdf.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/VOHQIrT7hCs5afGMRROvD.png) ### Video Inference ![Screenshot 2025-05-27 at 20-14-22 Video Understanding with Imgscope-OCR-2B-0527.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/fyAVI0hZICWpSXlcKaJF4.png) --- ### Buffering Output (Streaming) ```python buffer = "" for new_text in streamer: buffer += new_text buffer = buffer.replace("<|im_end|>", "") yield buffer ``` --- ### Key Features 1. **Realistic Messy Handwriting OCR** * Fine-tuned for **complex and hard-to-read handwritten inputs** using real-world handwriting datasets. 2. **Document OCR and Layout Understanding** * Accurately extracts text from structured documents, including scanned pages, forms, and academic papers. 3. **Image and Text Multi-modal Reasoning** * Combines **vision-language capabilities** for tasks like captioning, answering image-based queries, and understanding image+text prompts. 4. **Math Problem Solving and LaTeX Rendering** * Converts mathematical expressions and problem-solving steps into **LaTeX** format. 5. **Multi-turn Conversations** * Supports **dialogue-based reasoning**, retaining context for follow-up questions. 6. **Video + Image + Text-to-Text Generation** * Accepts inputs from videos, images, or combined media with text, and generates relevant output accordingly. --- ## **Intended Use** **Imgscope-OCR-2B-0527** is intended for: * Handwritten and printed document digitization * OCR pipelines for educational institutions and businesses * Academic and scientific content parsing, especially math-heavy documents * Assistive tools for visually impaired users * Robotic and mobile automation agents interpreting screen or camera data * Multilingual OCR processing for document translation or archiving