turgutguvercin/pix2struct-turkish-receipts

Model Details

This model is a fine-tuned version of google/pix2struct-base on a private Turkish receipt dataset. It is capable of extracting structured text such as:

Mağaza adı (store name)
Toplam tutar (total amount)
Tarih (date)
Ürünler (line items)
Document understanding for Turkish receipts
Key information extraction from scanned or photographed receipts
Useful for internal automation, ERP pre-filling, or financial apps

⚠️ Limitations

This model is an experimental prototype fine-tuned for Turkish receipt extraction using Pix2Struct on custom dataset. Approximately 1100 receipts has been used to fine tune it.

While it performs reasonably well on short and clean receipts, it has notable limitations:

❌ Performance degrades significantly on long or complex receipts
⚠️ GPU usage is relatively high due to the nature of the Pix2Struct architecture
❌ Not optimized for real-time or production-level use cases
There are more robust and efficient alternatives.

During training, various learning rates were tested, but it’s possible that suboptimal hyperparameter tuning (especially learning rate and generation strategy) affected generalization.
This may partially explain its weaker performance on longer receipts.

Additionally:

⚠️ The training data included high-frequency entities, such as the same store name (e.g. "A101") appearing repeatedly. This may have caused the model to hallucinate common store names even when they're not present in the image.
⚠️ Overfitting to dominant patterns in the dataset (e.g. fixed receipt templates) may also reduce generalization

We recommend using this model primarily for research, benchmarking, or experimentation purposes.

📊 Metrics

The model was evaluated on a private Turkish receipt test set.

Metric	Score
val_edit_distance	0.112

📥 How to Use

from transformers import Pix2StructForConditionalGeneration, Pix2StructProcessor
from PIL import Image
import torch

model_id = "turgutguvercin/pix2struct-turkish-receipts"

# Load model and processor
model = Pix2StructForConditionalGeneration.from_pretrained(model_id)
processor = Pix2StructProcessor.from_pretrained(model_id)

# Put model in eval mode and move to device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()


# Load your image (make sure it's in RGB)
image_path = "test_1.jpg"
image = Image.open(image_path).convert("RGB")

# Preprocess image
inputs = processor(images=image, return_tensors="pt").to(device)

# Generate prediction
with torch.no_grad():
    outputs = model.generate(**inputs,
        max_length=768,
        early_stopping=True,
        num_beams=1,  # Reduce beam search for faster processing
        do_sample=False,
        use_cache=True,
        pad_token_id=processor.tokenizer.pad_token_id,
        eos_token_id=processor.tokenizer.eos_token_id,)

# Decode tokens
predicted_text = processor.decode(outputs[0], skip_special_tokens=True)
print(predicted_text)
# returns <s_store_name> SOK MARKETLER T.A.S.</s_store_name><s_tax_id> 81301318199</s_tax_id><s_date> 21/02/2025</s_date><s_menu><s_nm> MIS UHT SUT LAKTOZSU</s_nm><s_cnt> 1 x</s_cnt><s_price> 39,75</s_price><s_tax_rate> %01</s_tax_rate> <sep/><s_nm> LEZZCAFE SALEP 17 GR</s_nm><s_cnt> 4 x</s_cnt><s_price> 18,00</s_price><s_tax_rate> %01</s_tax_rate></s_menu><s_sub_total><s_tax_price> 0,57</s_tax_price></s_sub_total><s_total><s_total_price> 57,75</s_total_price></s_total>

turgutguvercin
/

pix2struct-turkish-receipts

Model Details

⚠️ Limitations

📊 Metrics

📥 How to Use

Model tree for turgutguvercin/pix2struct-turkish-receipts