Model Card: mic7ch/llama-3.2-11b-vision-ManchuOCR
Model Overview
This model constitutes a fine-tuned variant of meta-llama/Llama-3.2-11B-Vision-Instruct
, adapted specifically for Optical Character Recognition (OCR) of woodblock-printed Manchu texts. It is designed to transcribe visual images of Manchu script into Romanized Manchu, facilitating the digitization and linguistic analysis of historical sources.
The development of this model contributes to the broader digital preservation of endangered scripts and supports scholarly efforts in Qing history, philology, and digital humanities.
Model Architecture
- Base model:
meta-llama/Llama-3.2-11B-Vision-Instruct
- Architecture type: Multimodal vision-language model
- Number of parameters: 11 billion
- Output: Romanized Manchu
Intended Use
This model is intended for the transcription of woodblock-printed Manchu documents into Romanized text. It is particularly suitable for:
- Historical document digitization
- Corpus construction for linguistic and philological research
- Projects in digital humanities involving Qing dynasty archives
The model does not support handwritten Manchu text or scripts printed in highly decorative or atypical fonts.
Training Data
The model was fine-tuned on the dataset mic7ch/manchu_sub1
, which was adapted from the open-source dataset curated by tyotakuki/ManchuOCR. The training set comprises monochrome images of individual Manchu words, formatted as black text on white background, preprocessed to a uniform size of 480 × 60 pixels with the text aligned to the left.
Evaluation
The model was evaluated on two separate datasets, with performance reported as follows:
Dataset: mic7ch/manchu_sub2
(1,000 images)
- Character Error Rate (CER): 0.0015
- Word Accuracy: 98.80%
Dataset: mic7ch/test_data_long
(218 word images from the woodblock set)
- Character Error Rate (CER): 0.0253
- Word Accuracy: 91.74%
These results indicate high transcription accuracy, particularly in clean and consistently formatted inputs.
Recommended Use Conditions
To ensure optimal OCR performance, users are advised to format input images as follows:
- Image size should be exactly 480 × 60 pixels
- Text must appear as black characters on a white background
- The Manchu word should be left-aligned, with white padding occupying the remainder of the image
For reference image formatting, consult the dataset mic7ch/test_data_long
, which provides properly structured examples.
Limitations
While the model performs well on the domain of woodblock-printed Manchu, it has the following limitations:
- It is not designed to process handwritten text or images with significant noise, degradation, or distortion.
- The model outputs Romanized text based on standard transliteration and may not account for orthographic or dialectal variation present in historical documents.
- Performance may degrade when applied to text not conforming to the 480×60 pixel format or featuring multiple words per image.
How to Use
The model may be invoked with the following instruction prompt to achieve optimal transcription accuracy:
instruction = (
"You are an expert OCR system for Manchu script. "
"Extract the text from the provided image with perfect accuracy. "
"Format your answer exactly as follows: first line with 'Manchu:' followed by the Manchu script, "
"then a new line with 'Roman:' followed by the romanized transliteration."
)
messages = [
{
"role": "user",
"content": [{"type": "image"}, {"type": "text", "text": instruction}],
}
]
It is critical that the input image fully complies with the prescribed formatting specifications outlined above: 480×60 pixels, black text on a white background, and left-aligned text.
Authors and Acknowledgments
This model was developed by:
- Dr. Yan Hon Michael Chung
- Dr. Donghyeok Choi
The authors would like to thank Dr. Miguel Escobar Varela for his conceptual inspiration drawn from the Jawi OCR project, which demonstrated the feasibility of adapting large vision-language models for historical and minority scripts.
Special acknowledgment is also due to tyotakuki/ManchuOCR for providing the foundational training dataset.
Citation
Please cite this model as follows:
@misc{llama-3.2-11b-vision-ManchuOCR,
title = {Manchu OCR},
author = {Yan Hon Michael Chung and Donghyeok Choi},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/mic7ch/llama-3.2-11b-vision-ManchuOCR}
}
- Downloads last month
- 9
Model tree for mic7ch/llama-3.2-11b-vision-ManchuOCR
Base model
meta-llama/Llama-3.2-11B-Vision-Instruct