---
base_model: unsloth/Llama-3.2-11B-Vision-Instruct
library_name: peft
datasets:
- mic7ch/manchu_sub1
license: mit
metrics:
- cer
---

# Model Card: `mic7ch/llama-3.2-11b-vision-ManchuOCR`

## Model Overview

This model constitutes a fine-tuned variant of `meta-llama/Llama-3.2-11B-Vision-Instruct`, adapted specifically for Optical Character Recognition (OCR) of **woodblock-printed Manchu texts**. It is designed to transcribe visual images of Manchu script into Romanized Manchu, facilitating the digitization and linguistic analysis of historical sources.

The development of this model contributes to the broader digital preservation of endangered scripts and supports scholarly efforts in Qing history, philology, and digital humanities.

## Model Architecture

- **Base model**: `meta-llama/Llama-3.2-11B-Vision-Instruct`  
- **Architecture type**: Multimodal vision-language model  
- **Number of parameters**: 11 billion  
- **Output**: Romanized Manchu

## Intended Use

This model is intended for the transcription of **woodblock-printed Manchu documents** into Romanized text. It is particularly suitable for:

- Historical document digitization
- Corpus construction for linguistic and philological research
- Projects in digital humanities involving Qing dynasty archives

The model does not support handwritten Manchu text or scripts printed in highly decorative or atypical fonts.

## Training Data

The model was fine-tuned on the dataset [`mic7ch/manchu_sub1`](https://huggingface.co/datasets/mic7ch/manchu_sub1), which was adapted from the open-source dataset curated by [tyotakuki/ManchuOCR](https://github.com/tyotakuki/ManchuOCR). The training set comprises monochrome images of individual Manchu words, formatted as **black text on white background**, preprocessed to a uniform size of **480 × 60 pixels** with the text aligned to the left.

## Evaluation

The model was evaluated on two separate datasets, with performance reported as follows:

**Dataset: `mic7ch/manchu_sub2` (1,000 images)**  
- Character Error Rate (CER): 0.0015  
- Word Accuracy: 98.80%

**Dataset: `mic7ch/test_data_long` (218 word images from the woodblock set)**  
- Character Error Rate (CER): 0.0253  
- Word Accuracy: 91.74%

These results indicate high transcription accuracy, particularly in clean and consistently formatted inputs.

## Recommended Use Conditions

To ensure optimal OCR performance, users are advised to format input images as follows:

- Image size should be exactly **480 × 60 pixels**  
- Text must appear as **black characters on a white background**  
- The Manchu word should be **left-aligned**, with white padding occupying the remainder of the image

For reference image formatting, consult the dataset [`mic7ch/test_data_long`](https://huggingface.co/datasets/mic7ch/test_data_long), which provides properly structured examples.

## Limitations

While the model performs well on the domain of woodblock-printed Manchu, it has the following limitations:

- It is not designed to process handwritten text or images with significant noise, degradation, or distortion.
- The model outputs Romanized text based on standard transliteration and may not account for orthographic or dialectal variation present in historical documents.
- Performance may degrade when applied to text not conforming to the 480×60 pixel format or featuring multiple words per image.

## How to Use

The model may be invoked with the following instruction prompt to achieve optimal transcription accuracy:

```python
instruction = (
    "You are an expert OCR system for Manchu script. "
    "Extract the text from the provided image with perfect accuracy. "
    "Format your answer exactly as follows: first line with 'Manchu:' followed by the Manchu script, "
    "then a new line with 'Roman:' followed by the romanized transliteration."
)

messages = [
    {
        "role": "user",
        "content": [{"type": "image"}, {"type": "text", "text": instruction}],
    }
]
```
It is critical that the input image fully complies with the prescribed formatting specifications outlined above: 480×60 pixels, black text on a white background, and left-aligned text.

## Authors and Acknowledgments

This model was developed by:

- **Dr. Yan Hon Michael Chung**  
- **Dr. Donghyeok Choi**

The authors would like to thank **Dr. Miguel Escobar Varela** for his conceptual inspiration drawn from the [Jawi OCR project](https://huggingface.co/culturalheritagenus/qwen-for-jawi-v1), which demonstrated the feasibility of adapting large vision-language models for historical and minority scripts.

Special acknowledgment is also due to [tyotakuki/ManchuOCR](https://github.com/tyotakuki/ManchuOCR) for providing the foundational training dataset.


## Citation

Please cite this model as follows:

```bibtex
@misc{llama-3.2-11b-vision-ManchuOCR,
  title     = {Manchu OCR},
  author    = {Yan Hon Michael Chung and Donghyeok Choi}, 
  year      = {2025},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/mic7ch/llama-3.2-11b-vision-ManchuOCR}
}
```