Model Card: mic7ch/llama-3.2-11b-vision-ManchuOCR

Model Overview

This model constitutes a fine-tuned variant of meta-llama/Llama-3.2-11B-Vision-Instruct, adapted specifically for Optical Character Recognition (OCR) of woodblock-printed Manchu texts. It is designed to transcribe visual images of Manchu script into Romanized Manchu, facilitating the digitization and linguistic analysis of historical sources.

The development of this model contributes to the broader digital preservation of endangered scripts and supports scholarly efforts in Qing history, philology, and digital humanities.

Model Architecture

  • Base model: meta-llama/Llama-3.2-11B-Vision-Instruct
  • Architecture type: Multimodal vision-language model
  • Number of parameters: 11 billion
  • Output: Romanized Manchu

Intended Use

This model is intended for the transcription of woodblock-printed Manchu documents into Romanized text. It is particularly suitable for:

  • Historical document digitization
  • Corpus construction for linguistic and philological research
  • Projects in digital humanities involving Qing dynasty archives

The model does not support handwritten Manchu text or scripts printed in highly decorative or atypical fonts.

Training Data

The model was fine-tuned on the dataset mic7ch/manchu_sub1, which was adapted from the open-source dataset curated by tyotakuki/ManchuOCR. The training set comprises monochrome images of individual Manchu words, formatted as black text on white background, preprocessed to a uniform size of 480 × 60 pixels with the text aligned to the left.

Evaluation

The model was evaluated on two separate datasets, with performance reported as follows:

Dataset: mic7ch/manchu_sub2 (1,000 images)

  • Character Error Rate (CER): 0.0015
  • Word Accuracy: 98.80%

Dataset: mic7ch/test_data_long (218 word images from the woodblock set)

  • Character Error Rate (CER): 0.0253
  • Word Accuracy: 91.74%

These results indicate high transcription accuracy, particularly in clean and consistently formatted inputs.

Recommended Use Conditions

To ensure optimal OCR performance, users are advised to format input images as follows:

  • Image size should be exactly 480 × 60 pixels
  • Text must appear as black characters on a white background
  • The Manchu word should be left-aligned, with white padding occupying the remainder of the image

For reference image formatting, consult the dataset mic7ch/test_data_long, which provides properly structured examples.

Limitations

While the model performs well on the domain of woodblock-printed Manchu, it has the following limitations:

  • It is not designed to process handwritten text or images with significant noise, degradation, or distortion.
  • The model outputs Romanized text based on standard transliteration and may not account for orthographic or dialectal variation present in historical documents.
  • Performance may degrade when applied to text not conforming to the 480×60 pixel format or featuring multiple words per image.

How to Use

The model may be invoked with the following instruction prompt to achieve optimal transcription accuracy:

instruction = (
    "You are an expert OCR system for Manchu script. "
    "Extract the text from the provided image with perfect accuracy. "
    "Format your answer exactly as follows: first line with 'Manchu:' followed by the Manchu script, "
    "then a new line with 'Roman:' followed by the romanized transliteration."
)

messages = [
    {
        "role": "user",
        "content": [{"type": "image"}, {"type": "text", "text": instruction}],
    }
]

It is critical that the input image fully complies with the prescribed formatting specifications outlined above: 480×60 pixels, black text on a white background, and left-aligned text.

Authors and Acknowledgments

This model was developed by:

  • Dr. Yan Hon Michael Chung
  • Dr. Donghyeok Choi

The authors would like to thank Dr. Miguel Escobar Varela for his conceptual inspiration drawn from the Jawi OCR project, which demonstrated the feasibility of adapting large vision-language models for historical and minority scripts.

Special acknowledgment is also due to tyotakuki/ManchuOCR for providing the foundational training dataset.

Citation

Please cite this model as follows:

@misc{llama-3.2-11b-vision-ManchuOCR,
  title     = {Manchu OCR},
  author    = {Yan Hon Michael Chung and Donghyeok Choi}, 
  year      = {2025},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/mic7ch/llama-3.2-11b-vision-ManchuOCR}
}
Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mic7ch/llama-3.2-11b-vision-ManchuOCR

Dataset used to train mic7ch/llama-3.2-11b-vision-ManchuOCR