Tachiwin OCR
Tachiwin OCR is a specialized Tesseract 5 model fine-tuned from the Latin base model to support Optical Character Recognition (OCR) for Mexico's 68 indigenous languages and their hundreds of variants.
Model Description
Tachiwin OCR addresses a critical gap in OCR technology for underrepresented languages. The goal of this OCR model is to provide an accurate way to perform OCR (Optical Character Recognition) on documents such as PDF or images on texts written in any of the 68 indigenous languages of Mexico and their hundreds of variants that usually contains complex diacritics and characters not usually supported by OCR workflows.
- Model Name: Tachiwin OCR
- Base Model: Tesseract 5 Latin
- Languages Supported: 68 indigenous languages of Mexico
- Script: Latin with extended character support
- License: MIT
Model Details
Intended Use
This model is designed to digitize and preserve texts written in Mexico's indigenous languages, addressing the critical gap in OCR technology for underrepresented languages. It can be used for:
- Digitization of historical documents
- Educational material processing
- Cultural preservation projects
- Academic research
- Community documentation efforts
Languages Covered
This model supports all 68 indigenous languages of Mexico and hundreths of variants.
Sample Text Examples
The model excels at recognizing complex diacritics and special characters found in indigenous Mexican languages:
Latin OCR: (raw tesseract untrained)
Séli kie' ja sei Wa i ai sé ne ard néi jeu yéi pei la. Jau sét ma kid kota. Anii la a ngéi jeu jud pa la. Wai ai sgt la t& jau sét ma kid kota. Jau sé, ajo ikd anii la jau set, ma kid kotd .
Tachiwin OCR: (trained by Tachiwin)
Sëü kie’ jä sɇɨ Wa̱ i aɨ śɨ ne̱ aṟö nëi jeu yëi pei la̱. Jau sɇ́ɨ m̱a̱ kiä kötä. Añii Ia̱ a ngʉ́ɨ jeu juö poa̱̱ la̱. Wa̱ i aɨ sʉ́ɨ la̱ tä̱ jau sɇɨ ma̱ kiä kötä̱. Jau sɇ́ɨ, ajö̱ ikö añii Ia̱ jau sɇ́ɨ, ma̱ kiä kötä̱ .
Fragment of "Libro de literatura en lengua chinanteca de Usila, Oaxaca" Lorenzo-Isidrio A, SEP 1999
Performance
- Word Error Rate (WER): 1.04% on evaluation dataset
- Character Error Rate (CER): ~1% on evaluation dataset
- Accuracy: 95% word-level accuracy
Model Variants
- Best Model (
Tachiwin.traineddata
): Higher accuracy, slower processing (floating-point) - Fast Model (
Tachiwin_fast.traineddata
): Faster processing, slightly lower accuracy (integer-based)
Usage
Requirements
- Tesseract 5.0 or higher
- Compatible with tesseract-ocr Python wrapper
Installation
# Download the model files
wget [model-url]/Tachiwin.traineddata
wget [model-url]/Tachiwin_fast.traineddata
# Place in your tessdata directory or use --tessdata-dir flag
Basic Usage
# Command line usage
tesseract input_image.png output_text --tessdata-dir /path/to/tessdata -l Tachiwin
# With specific model variant
tesseract input_image.png output_text --tessdata-dir /path/to/tessdata -l Tachiwin_best
# Python usage
import pytesseract
from PIL import Image
# Load image
image = Image.open('document.png')
# OCR with Tachiwin model
text = pytesseract.image_to_string(
image,
lang='Tachiwin',
config='--tessdata-dir /path/to/tessdata'
)
print(text)
Recommended Settings
For best results with indigenous language texts:
# For printed text
tesseract image.png output -l Tachiwin --psm 6
# For single text lines
tesseract image.png output -l Tachiwin --psm 7
# For handwritten or degraded text
tesseract image.png output -l Tachiwin --psm 8
Training Data
The model was trained on a comprehensive dataset covering:
- Contemporary indigenous language texts
- Historical documents
- Educational materials
- Community publications
- Various fonts and writing styles
Note: Training data sources respect community intellectual property and cultural sensitivity.
Performance
- Character Error Rate (CER): ~1% on evaluation dataset
- Optimized for: Latin script with indigenous language diacritics and special characters
- Best suited for: Clear, well-contrasted text images
Limitations
- Performance may vary with handwritten texts
- Works best with Latin script; does not support indigenous syllabaries
- Quality depends on image resolution and contrast
- May require fine-tuning for specific regional variants
Ethical Considerations
Cultural Sensitivity
- This model is developed with respect for indigenous communities
- Training data collection followed ethical guidelines
- Community consultation was prioritized throughout development
Bias and Fairness
- Efforts made to include diverse regional variants
- Ongoing work to identify and address potential biases
- Community feedback encouraged for improvements
Data Privacy
- No personally identifiable information in training data
- Respects cultural and linguistic sovereignty
Citation
@model{tachiwin2024,
title={Tachiwin OCR: A Tesseract Model for Mexico's Indigenous Languages},
author={[Tachiwin]},
year={2025},
publisher={Hugging Face},
license={MIT}
}
License
This model is released under the MIT License. See the LICENSE file for details.
Contributing
We welcome contributions from indigenous language communities and researchers:
- Report issues with specific language variants
- Contribute additional training data
- Suggest improvements for regional variants
- Share use cases and applications
Contact
For questions, collaborations, or community partnerships, please open an issue in this repository.
In Memoriam Fidencio Hernández (1990-2025) founder of this initiative. May the indigenous languages of Mexico and the world are never lost. Tachiwin is an initiative commited to the preservation of indigenous languages of Mexico.