Smart Receipt Reader: Automatic Information Extraction with LayoutLMv2 (CORD-v2)

Overview and Project Contribution

This project aims to automatically and intelligently extract structured information from receipt images using the naver-clova-ix/cord-v2 dataset. For this purpose, a state-of-the-art multimodal model, microsoft/layoutlmv2-base-uncased, was fine-tuned and achieved high success.

So, Why This Model Instead of Just Using OCR? What is the Advantage of Our Model?

This project provides a practical answer to the frequently asked question, "If OCR (Optical Character Recognition) is already being used, what is the additional benefit of such a model?":

Limitations of OCR:
- OCR converts the text in an image into digital characters. That is, it reads writings like "Toast" or "35.50" on the receipt.
- However, OCR cannot understand what these writings mean (their semantic role) or their place and importance (context) within the overall structure of the receipt. Is "35.50" a product price, a total amount, or a date? OCR cannot distinguish this.
- Traditionally, after OCR, complex and fragile rule-based systems or templates that need to be prepared separately for each receipt format are used to extract this information. These methods easily become dysfunctional even with the slightest change in the receipt format and are very difficult to maintain.
Contribution of Our LayoutLMv2 Model and the Points It Addresses:
- Semantic Understanding: Our trained LayoutLMv2 model evaluates the text information from OCR, the positions of these texts on the receipt (bounding boxes), and the visual features of the receipt together. In this way, it understands and labels that the word "Toast" is a menu.nm (product name) and the number "35.50" is a total.total_price (total price).
- Structural Information Extraction: The model learns the general layout of the receipt and accurately extracts which information belongs to which field (e.g., which price belongs to which product).
- Flexibility and Robustness: It is much more robust to changes in different receipt formats, fonts, and text positions compared to rule-based systems. Instead of manually writing rules for new formats, the model learns from data and generalizes.
- Automation and Efficiency: This project eliminates manual data entry from receipts, saving time and money, and reducing human error. It can be used in many areas such as expense tracking, accounting automation, and retail analytics.

What Has This Project Achieved? This project has developed an intelligent system capable of extracting structured data from complex and variable receipt images with high accuracy (with a %95+ F1 score), requiring minimal human intervention. It offers a modern artificial intelligence solution where traditional methods fall short.

Dataset

Name: CORD (Consolidated Receipt Dataset for Post-OCR Parsing)
Source: naver-clova-ix/cord-v2 (Hugging Face Datasets)
Content: Receipt images and their structured JSON format labels with bounding boxes of the text in these images. In this project, 13,500 samples were used for training and 1,500 for validation.

Model

Base Model: microsoft/layoutlmv2-base-uncased
Task: Token Classification - Assigning each text piece (token) in receipts to predefined categories (e.g., menu.nm, total.total_price).

Key Features and Approach

The LayoutLMv2 model was trained for 10 epochs on the CORD-v2 dataset to label each text piece (token) in receipts with the relevant category.
Multimodal Learning: The model uses text information, the geometric position of the text in the document, and the visual features of the document together for semantic inference.
Data preprocessing steps include converting texts, bounding boxes (normalized to the 0-1000 range), and images into a format suitable for the model.
Hugging Face Transformers, Datasets, and Accelerate libraries were used for training and evaluation.

Performance (After 10 Epochs)

The model showed the following performance on the CORD-v2 test set:

Overall F1-Score (Weighted Avg): 0.9575 (approximately 96%)
Overall Precision (Weighted Avg): 0.9582 (approximately 96%)
Overall Recall (Weighted Avg): 0.9567 (approximately 96%)
Overall Accuracy: 0.9690 (approximately 97%)
Macro Avg F1-Score: 0.80 (Reflects lower performance in rare classes)

Best validation set performance (Epoch 10):

eval_f1: 0.9739
eval_accuracy: 0.9767
eval_loss: 0.1475

(See training logs and test set evaluation report for detailed class-based metrics.)

Usage (Example)

This model can be loaded directly from Hugging Face Hub. The following example Python code shows how to load the model, prepare the necessary inputs for a new receipt image (including OCR output), and extract information:

from transformers import LayoutLMv2Processor, LayoutLMv2ForTokenClassification
from PIL import Image
import torch
import numpy as np # for np.argmax

# 1. Load Your Model and Processor from Hugging Face Hub
MODEL_ID = "ogulcanakca/layoutlmv2-base-uncased-finetuned-cordv2-receipts"
try:
    processor = LayoutLMv2Processor.from_pretrained(MODEL_ID)
    model = LayoutLMv2ForTokenClassification.from_pretrained(MODEL_ID)
    print(f"'{MODEL_ID}' loaded successfully.")
except Exception as e:
    print(f"Error: Model or Processor could not be loaded. Check Model ID or your internet connection: {e}")
    # It might be better for the script not to continue in this case.
    raise

# Set the model to evaluation mode and move to the appropriate device
model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
print(f"Model moved to '{device}' and is in evaluation mode.")

# 2. Prepare the New Receipt Image and OCR Output
#    !!! THIS STEP IS MANDATORY: You must run an OCR engine for your own receipt image
#    to obtain 'words' and 'boxes' lists !!!

image_path = "path/to/your/receipt_image.jpg"  # <<< ENTER THE PATH TO YOUR OWN RECEIPT IMAGE HERE
try:
    image_pil = Image.open(image_path).convert("RGB")
    img_width, img_height = image_pil.size
    print(f"'{image_path}' loaded successfully (Size: {img_width}x{img_height}).")
except FileNotFoundError:
    print(f"ERROR: Receipt image '{image_path}' not found. Please enter a valid path.")
    # It would be good to stop here if the example doesn't work.
    exit()

# EXAMPLE OCR OUTPUT (MUST BE REPLACED with the output of a real OCR engine):
# These 'words' and 'boxes' lists should belong to the 'image_pil' image above.
example_words = ["Market", "Receipt", "1", "Piece", "Bread", "5.00", "TOTAL", "5.00"]
# Each box in the 'boxes' list should be in [x_min, y_min, x_max, y_max] format
# and normalized to the 0-1000 range.
# Example: norm_box = [int(1000*x1/img_width), int(1000*y1/img_height), int(1000*x2/img_width), int(1000*y2/img_height)]
example_boxes = [
    [70, 50, 180, 75], [65, 80, 190, 105], [80, 120, 100, 140], [110, 120, 180, 140],
    [80, 145, 190, 165], [300, 145, 380, 165], [75, 180, 200, 205], [300, 180, 380, 205]
]
# Please update the example_words and example_boxes above with your own OCR output.

# 3. Bring the Data to the Format Expected by the Model (with Processor)
# Similar to the `preprocess_data_cord_separated` function in training:
tokenized_inputs = processor.tokenizer(
    text=[example_words], # List within a list for batch
    boxes=[example_boxes],# List within a list for batch
    padding="max_length",
    truncation=True,
    max_length=512, # Same max_length as in training
    return_token_type_ids=True,
    return_attention_mask=True,
    return_tensors="pt"
)
image_features = processor.feature_extractor(images=[image_pil], return_tensors="pt")

inputs = {
    'input_ids': tokenized_inputs.input_ids.to(device),
    'attention_mask': tokenized_inputs.attention_mask.to(device),
    'token_type_ids': tokenized_inputs.token_type_ids.to(device),
    'bbox': tokenized_inputs.bbox.to(device),
    'image': image_features.pixel_values.to(device) # We updated the column name to 'image'
}

# 4. Make Predictions with the Model
print("\nMaking predictions with the model...")
with torch.no_grad(): # Disable gradient calculation
    outputs = model(**inputs)

logits = outputs.logits
predicted_ids_tensor = torch.argmax(logits, dim=2)
predicted_ids_list = predicted_ids_tensor[0].cpu().tolist() # For the first (and only) example

# 5. Interpret the Predictions
id2label = model.config.id2label # Get the label map from the model's configuration

# To match words with labels (tokens need to be mapped to original words)
# For a simple demonstration, let's print the predicted label for each token
# (Excluding special tokens and padding)
input_tokens = processor.tokenizer.convert_ids_to_tokens(inputs['input_ids'][0].cpu().tolist())

print("\nPredicted Labels:")
extracted_info = {}
current_word_tokens = ""
current_word_label = ""
# This part can be done more sophisticatedly using word_ids.
# It is more accurate to get word IDs with processor(..., return_word_ids=True) and match them.
# The following is a simplified approach and may not combine sub-tokens.
for token_str, pred_id in zip(input_tokens, predicted_ids_list):
    if pred_id != -100 and token_str not in (processor.tokenizer.cls_token, processor.tokenizer.sep_token, processor.tokenizer.pad_token):
        label_str = id2label[pred_id]
        print(f"Token: {token_str:<20} => Label: {label_str}")
        # Simple grouping (more advanced post-processing may be needed)
        if label_str != "others": # Let's ignore the "others" label
            if label_str not in extracted_info:
                extracted_info[label_str] = []
            extracted_info[label_str].append(token_str.replace("##", "")) # Combine sub-tokens

print("\nExtracted Information (Simple Grouping):")
for label, texts in extracted_info.items():
    print(f"{label}: {' '.join(texts)}")

Training Hyperparameters

Learning Rate: 5e-5
Number of Training Epochs: 10
Per Device Train Batch Size: 2
Per Device Eval Batch Size: 2
Gradient Accumulation Steps: 1
Warmup Ratio: 0.1
Weight Decay: 0.01
Optimizer: AdamW
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-8
LR Scheduler Type: linear
Mixed Precision: FP32 (fp16=True & bf16=True)
Seed for Reproducibility: 42
Max Sequence Length: 512

Enviroment Informations

model.safe_tensors (or pytorch_model.bin): ~802 MB
Dataset: CORD-v2 (naver-clova-ix/cord-v2) - 13,500 training examples
GPU: NVIDIA P100 (on Kaggle)
Total Training Time (for 10 epochs): Approximately 34 minutes 15 seconds
Inference Speed (Indicative):
Using Trainer.predict() on the test set (NVIDIA P100): Approximately 8.17 samples per second

@misc{ogulcanakca_layoutlmv2_cordv2_receipts_2025,
  author = {[Oğulcan Akca]},
  title = {Fine-tuned LayoutLMv2 for Receipt Information Extraction on CORD-v2},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {https://huggingface.co/ogulcanakca/layoutlmv2-base-uncased-finetuned-cordv2-receipts}
}

Model Card Contact

ogulcanakca (Hugging Face)
Mail: [email protected]

ogulcanakca
/

layoutlmv2-base-uncased-finetuned-cordv2-receipts