Model Details
This model is a fine-tuned version of google/pix2struct-base on a private Turkish receipt dataset. It is capable of extracting structured text such as:
Maฤaza adฤฑ (store name)
Toplam tutar (total amount)
Tarih (date)
รrรผnler (line items)
Document understanding for Turkish receipts
Key information extraction from scanned or photographed receipts
Useful for internal automation, ERP pre-filling, or financial apps
โ ๏ธ Limitations
This model is an experimental prototype fine-tuned for Turkish receipt extraction using Pix2Struct on custom dataset. Approximately 1100 receipts has been used to fine tune it.
While it performs reasonably well on short and clean receipts, it has notable limitations:
- โ Performance degrades significantly on long or complex receipts
- โ ๏ธ GPU usage is relatively high due to the nature of the Pix2Struct architecture
- โ Not optimized for real-time or production-level use cases
- There are more robust and efficient alternatives.
During training, various learning rates were tested, but itโs possible that suboptimal hyperparameter tuning (especially learning rate and generation strategy) affected generalization.
This may partially explain its weaker performance on longer receipts.
Additionally:
- โ ๏ธ The training data included high-frequency entities, such as the same store name (e.g.
"A101"
) appearing repeatedly. This may have caused the model to hallucinate common store names even when they're not present in the image. - โ ๏ธ Overfitting to dominant patterns in the dataset (e.g. fixed receipt templates) may also reduce generalization
We recommend using this model primarily for research, benchmarking, or experimentation purposes.
๐ Metrics
The model was evaluated on a private Turkish receipt test set.
Metric | Score |
---|---|
val_edit_distance | 0.112 |
๐ฅ How to Use
from transformers import Pix2StructForConditionalGeneration, Pix2StructProcessor
from PIL import Image
import torch
model_id = "turgutguvercin/pix2struct-turkish-receipts"
# Load model and processor
model = Pix2StructForConditionalGeneration.from_pretrained(model_id)
processor = Pix2StructProcessor.from_pretrained(model_id)
# Put model in eval mode and move to device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()
# Load your image (make sure it's in RGB)
image_path = "test_1.jpg"
image = Image.open(image_path).convert("RGB")
# Preprocess image
inputs = processor(images=image, return_tensors="pt").to(device)
# Generate prediction
with torch.no_grad():
outputs = model.generate(**inputs,
max_length=768,
early_stopping=True,
num_beams=1, # Reduce beam search for faster processing
do_sample=False,
use_cache=True,
pad_token_id=processor.tokenizer.pad_token_id,
eos_token_id=processor.tokenizer.eos_token_id,)
# Decode tokens
predicted_text = processor.decode(outputs[0], skip_special_tokens=True)
print(predicted_text)
# returns <s_store_name> SOK MARKETLER T.A.S.</s_store_name><s_tax_id> 81301318199</s_tax_id><s_date> 21/02/2025</s_date><s_menu><s_nm> MIS UHT SUT LAKTOZSU</s_nm><s_cnt> 1 x</s_cnt><s_price> 39,75</s_price><s_tax_rate> %01</s_tax_rate> <sep/><s_nm> LEZZCAFE SALEP 17 GR</s_nm><s_cnt> 4 x</s_cnt><s_price> 18,00</s_price><s_tax_rate> %01</s_tax_rate></s_menu><s_sub_total><s_tax_price> 0,57</s_tax_price></s_sub_total><s_total><s_total_price> 57,75</s_total_price></s_total>
- Downloads last month
- 47
Model tree for turgutguvercin/pix2struct-turkish-receipts
Base model
google/pix2struct-base