Authors

"Ajalooliste eestikeelsete OCR tekstide järeltöötluse ja hindamise automatiseerimine Eesti Rahvusraamatukogu jaoks" (2025, TalTech)

"Automation of Post-Processing and Evaluation of Historical Estonian OCR Texts for the National Library of Estonia"

Loore Lehtmets, Mari-Anna Meimer

Model Description

This model was developed as a Bachelor's thesis at Tallinn University of Technology. It is trained to predict the probability that a corrected OCR text is better in quality than the original OCR text. The model is intended to use together with OCR text correction models like our llammas-OCR-FT5k or llammas-OCR-FT13k. It is primarily intended for use by the National Library of Estonia on materials from their digital archive.

Model Sources

All training and testing code, including datasets and results, is available on GitHub:

Dataset

The model was trained on 10,300 text examples from the digital archive of the National Library of Estonia. These examples consist of OCR-generated text paired with the same OCR texts after they have been corrected by an OCR text correcting model like llammas-OCR-FT5k or llammas-OCR-FT13k. The model was evaluated on 2,001 separate text examples from the same archive. More information about the datasets and results is available on GitHub and in the thesis document.

How To Use?

To try out the model, copy the example code below into Google Colab or any Python environment. Edit the ocr_text and prediction variables with the OCR text and corrected text you want to get the probability for.

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel, PeftConfig
from huggingface_hub import login

# load model for probability predicting
peft_model_id = "mariannam/llammas-prediction-grading"
config = PeftConfig.from_pretrained(peft_model_id)

# load base model
base_model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    device_map="auto", # for CPU use device_map=None
    torch_dtype="auto"
)

# load adapters
model = PeftModel.from_pretrained(base_model, peft_model_id)

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# input ocr text HERE
ocr_text = "Misso wallas awati awalik telefoni-kõnc-punkt Hinol ja PiiganbiS mõlemal, Kanepi kanbu."

# input corrected text you want the probability for HERE
prediction = "Misso wallas awati awalik telefoni-kõne-punkt Hinol ja Piigandil mõlemal, Kanepi kandu."

# prompt tempelate
prompt = f"""### Instruction:
Kui suur on tõenäosus, et parandatud tekst on OCR tekstist parem? Tagasta tõenäosus täisarvulise protsendina.

### Input:
OCR TEKST: {ocr_text}

PARANDATUD TEKST: {prediction}

### Response:
"""

# generate and print output
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=3)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)[len(prompt):].strip())
Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mariannam/llammas-prediction-grading

Finetuned
tartuNLP/Llammas
Adapter
(4)
this model