Model Card: `mic7ch/llama-3.2-11b-vision-ManchuOCR`

Model Overview

This model constitutes a fine-tuned variant of meta-llama/Llama-3.2-11B-Vision-Instruct, adapted specifically for Optical Character Recognition (OCR) of woodblock-printed Manchu texts. It is designed to transcribe visual images of Manchu script into Romanized Manchu, facilitating the digitization and linguistic analysis of historical sources.

The development of this model contributes to the broader digital preservation of endangered scripts and supports scholarly efforts in Qing history, philology, and digital humanities.

Model Architecture

Base model: meta-llama/Llama-3.2-11B-Vision-Instruct
Architecture type: Multimodal vision-language model
Number of parameters: 11 billion
Output: Romanized Manchu

Intended Use

This model is intended for the transcription of woodblock-printed Manchu documents into Romanized text. It is particularly suitable for:

Historical document digitization
Corpus construction for linguistic and philological research
Projects in digital humanities involving Qing dynasty archives

The model does not support handwritten Manchu text or scripts printed in highly decorative or atypical fonts.

Training Data

The model was fine-tuned on the dataset mic7ch/manchu_sub1, which was adapted from the open-source dataset curated by tyotakuki/ManchuOCR. The training set comprises monochrome images of individual Manchu words, formatted as black text on white background, preprocessed to a uniform size of 480 × 60 pixels with the text aligned to the left.

Evaluation

The model was evaluated on two separate datasets, with performance reported as follows:

Dataset: mic7ch/manchu_sub2 (1,000 images)

Character Error Rate (CER): 0.0015
Word Accuracy: 98.80%

Dataset: mic7ch/test_data_long (218 word images from the woodblock set)

Character Error Rate (CER): 0.0253
Word Accuracy: 91.74%

These results indicate high transcription accuracy, particularly in clean and consistently formatted inputs.

Recommended Use Conditions

To ensure optimal OCR performance, users are advised to format input images as follows:

Image size should be exactly 480 × 60 pixels
Text must appear as black characters on a white background
The Manchu word should be left-aligned, with white padding occupying the remainder of the image

For reference image formatting, consult the dataset mic7ch/test_data_long, which provides properly structured examples.

Limitations

While the model performs well on the domain of woodblock-printed Manchu, it has the following limitations:

It is not designed to process handwritten text or images with significant noise, degradation, or distortion.
The model outputs Romanized text based on standard transliteration and may not account for orthographic or dialectal variation present in historical documents.
Performance may degrade when applied to text not conforming to the 480×60 pixel format or featuring multiple words per image.

How to Use

The model may be invoked with the following instruction prompt to achieve optimal transcription accuracy:

instruction = (
    "You are an expert OCR system for Manchu script. "
    "Extract the text from the provided image with perfect accuracy. "
    "Format your answer exactly as follows: first line with 'Manchu:' followed by the Manchu script, "
    "then a new line with 'Roman:' followed by the romanized transliteration."
)

messages = [
    {
        "role": "user",
        "content": [{"type": "image"}, {"type": "text", "text": instruction}],
    }
]

It is critical that the input image fully complies with the prescribed formatting specifications outlined above: 480×60 pixels, black text on a white background, and left-aligned text.

Authors and Acknowledgments

This model was developed by:

Dr. Yan Hon Michael Chung
Dr. Donghyeok Choi

The authors would like to thank Dr. Miguel Escobar Varela for his conceptual inspiration drawn from the Jawi OCR project, which demonstrated the feasibility of adapting large vision-language models for historical and minority scripts.

Special acknowledgment is also due to tyotakuki/ManchuOCR for providing the foundational training dataset.

Citation

Please cite this model as follows:

@misc{llama-3.2-11b-vision-ManchuOCR,
  title     = {Manchu OCR},
  author    = {Yan Hon Michael Chung and Donghyeok Choi}, 
  year      = {2025},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/mic7ch/llama-3.2-11b-vision-ManchuOCR}
}

mic7ch
/

llama-3.2-11b-vision-ManchuOCR

Model Card: `mic7ch/llama-3.2-11b-vision-ManchuOCR`

Model Overview

Model Architecture

Intended Use

Training Data

Evaluation

Recommended Use Conditions

Limitations

How to Use

Authors and Acknowledgments

Citation

Model tree for mic7ch/llama-3.2-11b-vision-ManchuOCR

Dataset used to train mic7ch/llama-3.2-11b-vision-ManchuOCR

Model Card: mic7ch/llama-3.2-11b-vision-ManchuOCR

Model Overview

Model Architecture

Intended Use

Training Data

Evaluation

Recommended Use Conditions

Limitations

How to Use

Authors and Acknowledgments

Citation

Model tree for mic7ch/llama-3.2-11b-vision-ManchuOCR

Dataset used to train mic7ch/llama-3.2-11b-vision-ManchuOCR

Model Card: `mic7ch/llama-3.2-11b-vision-ManchuOCR`