Authors
"Ajalooliste eestikeelsete OCR tekstide järeltöötluse ja hindamise automatiseerimine Eesti Rahvusraamatukogu jaoks" (2025, TalTech)
"Automation of Post-Processing and Evaluation of Historical Estonian OCR Texts for the National Library of Estonia"
Loore Lehtmets, Mari-Anna Meimer
Model Description
This model was developed as a Bachelor's thesis at Tallinn University of Technology. It is trained to correct OCR errors in historical Estonian texts and is primarily intended for use by the National Library of Estonia on materials from their digital archive.
Model Sources
All training and testing code, including datasets and results, is available on GitHub:
- Repository: https://github.com/mari-annam/estonian-ocr
Dataset
The model was trained on 5,145 text examples from the digital archive of the National Library of Estonia. These examples consist of OCR-generated text paired with human-corrected versions. The model was evaluated on 2,001 separate text examples from the same archive. More information about the datasets and results is available on GitHub and in the thesis document.
How To Use?
To try out the model, copy the example code below into Google Colab or any Python environment. Edit the ocr_text variable with the OCR text you want to correct.
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel, PeftConfig
from huggingface_hub import login
# load model for correcting OCR errors, 2 models available
peft_model_id = "mariannam/llammas-OCR-FT5k"
config = PeftConfig.from_pretrained(peft_model_id)
# load base model
base_model = AutoModelForCausalLM.from_pretrained(
config.base_model_name_or_path,
device_map="auto", # for CPU use device_map=None
torch_dtype="auto"
)
# load adapters
model = PeftModel.from_pretrained(base_model, peft_model_id)
# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
# input ocr text to fix HERE
ocr_text = "Misso wallas awati awalik telefoni-kõnc-punkt Hinol ja PiiganbiS mõlemal, Kanepi kanbu."
# prompt tempelate
prompt = f"""### Instruction:
Paranda vead selles eestikeelses OCR tekstis.
### Input:
{ocr_text}
### Response:
"""
# calculate dynamic max new tokens allowed
input_length = len(tokenizer(ocr_text, truncation=True)['input_ids'])
max_new_tokens = input_length + 10
# generate and print output
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=max_new_tokens)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)[len(prompt):].strip())
- Downloads last month
- 5