|
--- |
|
language: |
|
- en |
|
- id |
|
- my |
|
license: apache-2.0 |
|
tags: |
|
- receipt-parsing |
|
- information-extraction |
|
- bart |
|
- ocr |
|
- text-to-text |
|
datasets: |
|
- dhiaznaidi/receiptdatasetssd300v2 |
|
library_name: transformers |
|
pipeline-tag: |
|
--- |
|
|
|
# BART-base Receipt Parser |
|
|
|
## Model Description |
|
|
|
This model is a fine-tuned version of [facebook/bart-base](https://huggingface.co/facebook/bart-base) for receipt parsing tasks. The model is trained to extract key information from receipt text, specifically: |
|
|
|
- **Date**: Transaction date from the receipt |
|
- **Company Name**: Name of the merchant/store |
|
- **Total Amount**: Final amount paid |
|
|
|
## Dataset |
|
|
|
The model was trained using the [Receipt Dataset SSD300 V2](https://www.kaggle.com/datasets/dhiaznaidi/receiptdatasetssd300v2) from Kaggle, which contains receipt images with corresponding labels. |
|
|
|
## Data Processing Pipeline |
|
|
|
1. **OCR Processing**: All receipt images from the dataset were processed using [EasyOCR](https://github.com/JaidedAI/EasyOCR) to extract raw text |
|
2. **Input-Output Mapping**: The extracted OCR text serves as input, while the labeled data from the Kaggle dataset serves as the target output |
|
3. **Fine-tuning**: Supervised fine-tuning was performed on the facebook/bart-base model |
|
|
|
## Usage |
|
|
|
```python |
|
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer |
|
|
|
# Load model and tokenizer |
|
model = AutoModelForSeq2SeqLM.from_pretrained("farhamu/bart-base-receipt-parser-v1") |
|
tokenizer = AutoTokenizer.from_pretrained("farhamu/bart-base-receipt-parser-v1") |
|
|
|
# Example usage |
|
receipt_text = """ |
|
SUPERMARKET ABC |
|
123 Main Street |
|
City, State 12345 |
|
Date: 2024-01-15 |
|
Item 1: $5.99 |
|
Item 2: $3.50 |
|
Tax: $0.76 |
|
Total: $10.25 |
|
Thank you for shopping! |
|
""" |
|
|
|
# Tokenize input |
|
inputs = tokenizer(receipt_text, return_tensors="pt", max_length=512, truncation=True, padding=True) |
|
|
|
# Generate output |
|
outputs = model.generate( |
|
**inputs, |
|
max_length=150, |
|
num_beams=4, |
|
early_stopping=True |
|
) |
|
|
|
# Decode result |
|
result = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
print(result) |
|
``` |
|
|
|
## Expected Output Format |
|
|
|
The model outputs structured information in the following format: |
|
``` |
|
Date: [extracted_date] |
|
Company Name: [extracted_company_name] |
|
Total Amount: [extracted_total_amount] |
|
``` |
|
|
|
## Training Details |
|
|
|
- **Base Model**: facebook/bart-base |
|
- **Task**: Text-to-Text Generation (Receipt Information Extraction) |
|
- **Training Data**: OCR-processed receipt text with labeled ground truth |
|
- **Data Source**: [Receipt Dataset SSD300 V2](https://www.kaggle.com/datasets/dhiaznaidi/receiptdatasetssd300v2) |
|
- **OCR Tool**: EasyOCR |
|
|
|
## Limitations |
|
|
|
- Performance may vary depending on OCR quality |
|
- Trained specifically on the format and style of receipts in the training dataset |
|
- May require additional fine-tuning for receipts with significantly different formats or languages |
|
|
|
## Use Cases |
|
|
|
- Automated receipt processing for expense management |
|
- Financial document digitization |
|
- Retail analytics and data extraction |
|
- Accounting automation |
|
|
|
## Citation |
|
|
|
If you use this model, please cite the original dataset: |
|
|
|
```bibtex |
|
@dataset{dhiaznaidi2024receipt, |
|
title={Receipt Dataset SSD300 V2}, |
|
author={Dhiaz Naidi}, |
|
year={2024}, |
|
url={https://www.kaggle.com/datasets/dhiaznaidi/receiptdatasetssd300v2} |
|
} |
|
``` |