farhamu's picture
update repo url
afb78c2 verified
---
language:
- en
- id
- my
license: apache-2.0
tags:
- receipt-parsing
- information-extraction
- bart
- ocr
- text-to-text
datasets:
- dhiaznaidi/receiptdatasetssd300v2
library_name: transformers
pipeline-tag:
---
# BART-base Receipt Parser
## Model Description
This model is a fine-tuned version of [facebook/bart-base](https://huggingface.co/facebook/bart-base) for receipt parsing tasks. The model is trained to extract key information from receipt text, specifically:
- **Date**: Transaction date from the receipt
- **Company Name**: Name of the merchant/store
- **Total Amount**: Final amount paid
## Dataset
The model was trained using the [Receipt Dataset SSD300 V2](https://www.kaggle.com/datasets/dhiaznaidi/receiptdatasetssd300v2) from Kaggle, which contains receipt images with corresponding labels.
## Data Processing Pipeline
1. **OCR Processing**: All receipt images from the dataset were processed using [EasyOCR](https://github.com/JaidedAI/EasyOCR) to extract raw text
2. **Input-Output Mapping**: The extracted OCR text serves as input, while the labeled data from the Kaggle dataset serves as the target output
3. **Fine-tuning**: Supervised fine-tuning was performed on the facebook/bart-base model
## Usage
```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
# Load model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("farhamu/bart-base-receipt-parser-v1")
tokenizer = AutoTokenizer.from_pretrained("farhamu/bart-base-receipt-parser-v1")
# Example usage
receipt_text = """
SUPERMARKET ABC
123 Main Street
City, State 12345
Date: 2024-01-15
Item 1: $5.99
Item 2: $3.50
Tax: $0.76
Total: $10.25
Thank you for shopping!
"""
# Tokenize input
inputs = tokenizer(receipt_text, return_tensors="pt", max_length=512, truncation=True, padding=True)
# Generate output
outputs = model.generate(
**inputs,
max_length=150,
num_beams=4,
early_stopping=True
)
# Decode result
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
```
## Expected Output Format
The model outputs structured information in the following format:
```
Date: [extracted_date]
Company Name: [extracted_company_name]
Total Amount: [extracted_total_amount]
```
## Training Details
- **Base Model**: facebook/bart-base
- **Task**: Text-to-Text Generation (Receipt Information Extraction)
- **Training Data**: OCR-processed receipt text with labeled ground truth
- **Data Source**: [Receipt Dataset SSD300 V2](https://www.kaggle.com/datasets/dhiaznaidi/receiptdatasetssd300v2)
- **OCR Tool**: EasyOCR
## Limitations
- Performance may vary depending on OCR quality
- Trained specifically on the format and style of receipts in the training dataset
- May require additional fine-tuning for receipts with significantly different formats or languages
## Use Cases
- Automated receipt processing for expense management
- Financial document digitization
- Retail analytics and data extraction
- Accounting automation
## Citation
If you use this model, please cite the original dataset:
```bibtex
@dataset{dhiaznaidi2024receipt,
title={Receipt Dataset SSD300 V2},
author={Dhiaz Naidi},
year={2024},
url={https://www.kaggle.com/datasets/dhiaznaidi/receiptdatasetssd300v2}
}
```