Model Details

This model is a fine-tuned version of google/pix2struct-base on a private Turkish receipt dataset. It is capable of extracting structured text such as:

  • MaฤŸaza adฤฑ (store name)

  • Toplam tutar (total amount)

  • Tarih (date)

  • รœrรผnler (line items)

  • Document understanding for Turkish receipts

  • Key information extraction from scanned or photographed receipts

  • Useful for internal automation, ERP pre-filling, or financial apps

โš ๏ธ Limitations

This model is an experimental prototype fine-tuned for Turkish receipt extraction using Pix2Struct on custom dataset. Approximately 1100 receipts has been used to fine tune it.

While it performs reasonably well on short and clean receipts, it has notable limitations:

  • โŒ Performance degrades significantly on long or complex receipts
  • โš ๏ธ GPU usage is relatively high due to the nature of the Pix2Struct architecture
  • โŒ Not optimized for real-time or production-level use cases
  • There are more robust and efficient alternatives.

During training, various learning rates were tested, but itโ€™s possible that suboptimal hyperparameter tuning (especially learning rate and generation strategy) affected generalization.
This may partially explain its weaker performance on longer receipts.

Additionally:

  • โš ๏ธ The training data included high-frequency entities, such as the same store name (e.g. "A101") appearing repeatedly. This may have caused the model to hallucinate common store names even when they're not present in the image.
  • โš ๏ธ Overfitting to dominant patterns in the dataset (e.g. fixed receipt templates) may also reduce generalization

We recommend using this model primarily for research, benchmarking, or experimentation purposes.

๐Ÿ“Š Metrics

The model was evaluated on a private Turkish receipt test set.

Metric Score
val_edit_distance 0.112

๐Ÿ“ฅ How to Use

from transformers import Pix2StructForConditionalGeneration, Pix2StructProcessor
from PIL import Image
import torch

model_id = "turgutguvercin/pix2struct-turkish-receipts"

# Load model and processor
model = Pix2StructForConditionalGeneration.from_pretrained(model_id)
processor = Pix2StructProcessor.from_pretrained(model_id)

# Put model in eval mode and move to device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()


# Load your image (make sure it's in RGB)
image_path = "test_1.jpg"
image = Image.open(image_path).convert("RGB")

# Preprocess image
inputs = processor(images=image, return_tensors="pt").to(device)

# Generate prediction
with torch.no_grad():
    outputs = model.generate(**inputs,
        max_length=768,
        early_stopping=True,
        num_beams=1,  # Reduce beam search for faster processing
        do_sample=False,
        use_cache=True,
        pad_token_id=processor.tokenizer.pad_token_id,
        eos_token_id=processor.tokenizer.eos_token_id,)

# Decode tokens
predicted_text = processor.decode(outputs[0], skip_special_tokens=True)
print(predicted_text)
# returns <s_store_name> SOK MARKETLER T.A.S.</s_store_name><s_tax_id> 81301318199</s_tax_id><s_date> 21/02/2025</s_date><s_menu><s_nm> MIS UHT SUT LAKTOZSU</s_nm><s_cnt> 1 x</s_cnt><s_price> 39,75</s_price><s_tax_rate> %01</s_tax_rate> <sep/><s_nm> LEZZCAFE SALEP 17 GR</s_nm><s_cnt> 4 x</s_cnt><s_price> 18,00</s_price><s_tax_rate> %01</s_tax_rate></s_menu><s_sub_total><s_tax_price> 0,57</s_tax_price></s_sub_total><s_total><s_total_price> 57,75</s_total_price></s_total>
Downloads last month
47
Safetensors
Model size
282M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for turgutguvercin/pix2struct-turkish-receipts

Finetuned
(3)
this model