File size: 3,258 Bytes
fffb020
2db87f4
 
 
 
 
 
 
 
 
 
 
 
 
fffb020
2db87f4
fffb020
 
2db87f4
fffb020
2db87f4
fffb020
2db87f4
fffb020
2db87f4
 
 
fffb020
2db87f4
fffb020
2db87f4
fffb020
2db87f4
fffb020
2db87f4
 
 
fffb020
2db87f4
fffb020
2db87f4
 
fffb020
2db87f4
afb78c2
 
fffb020
2db87f4
 
 
 
 
 
 
 
 
 
 
 
fffb020
2db87f4
 
fffb020
2db87f4
 
 
 
 
 
 
fffb020
2db87f4
 
 
 
fffb020
2db87f4
fffb020
2db87f4
 
 
 
 
 
fffb020
 
 
2db87f4
 
 
 
 
fffb020
2db87f4
fffb020
2db87f4
 
 
fffb020
2db87f4
fffb020
2db87f4
 
 
 
fffb020
2db87f4
fffb020
2db87f4
fffb020
2db87f4
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
language: 
- en
- id
- my
license: apache-2.0
tags:
- receipt-parsing
- information-extraction
- bart
- ocr
- text-to-text
datasets:
- dhiaznaidi/receiptdatasetssd300v2
library_name: transformers
pipeline-tag:
---

# BART-base Receipt Parser

## Model Description

This model is a fine-tuned version of [facebook/bart-base](https://huggingface.co/facebook/bart-base) for receipt parsing tasks. The model is trained to extract key information from receipt text, specifically:

- **Date**: Transaction date from the receipt
- **Company Name**: Name of the merchant/store
- **Total Amount**: Final amount paid

## Dataset

The model was trained using the [Receipt Dataset SSD300 V2](https://www.kaggle.com/datasets/dhiaznaidi/receiptdatasetssd300v2) from Kaggle, which contains receipt images with corresponding labels.

## Data Processing Pipeline

1. **OCR Processing**: All receipt images from the dataset were processed using [EasyOCR](https://github.com/JaidedAI/EasyOCR) to extract raw text
2. **Input-Output Mapping**: The extracted OCR text serves as input, while the labeled data from the Kaggle dataset serves as the target output
3. **Fine-tuning**: Supervised fine-tuning was performed on the facebook/bart-base model

## Usage

```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("farhamu/bart-base-receipt-parser-v1")
tokenizer = AutoTokenizer.from_pretrained("farhamu/bart-base-receipt-parser-v1")

# Example usage
receipt_text = """
SUPERMARKET ABC
123 Main Street
City, State 12345
Date: 2024-01-15
Item 1: $5.99
Item 2: $3.50
Tax: $0.76
Total: $10.25
Thank you for shopping!
"""

# Tokenize input
inputs = tokenizer(receipt_text, return_tensors="pt", max_length=512, truncation=True, padding=True)

# Generate output
outputs = model.generate(
    **inputs,
    max_length=150,
    num_beams=4,
    early_stopping=True
)

# Decode result
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
```

## Expected Output Format

The model outputs structured information in the following format:
```
Date: [extracted_date]
Company Name: [extracted_company_name]
Total Amount: [extracted_total_amount]
```

## Training Details

- **Base Model**: facebook/bart-base
- **Task**: Text-to-Text Generation (Receipt Information Extraction)
- **Training Data**: OCR-processed receipt text with labeled ground truth
- **Data Source**: [Receipt Dataset SSD300 V2](https://www.kaggle.com/datasets/dhiaznaidi/receiptdatasetssd300v2)
- **OCR Tool**: EasyOCR

## Limitations

- Performance may vary depending on OCR quality
- Trained specifically on the format and style of receipts in the training dataset
- May require additional fine-tuning for receipts with significantly different formats or languages

## Use Cases

- Automated receipt processing for expense management
- Financial document digitization
- Retail analytics and data extraction
- Accounting automation

## Citation

If you use this model, please cite the original dataset:

```bibtex
@dataset{dhiaznaidi2024receipt,
  title={Receipt Dataset SSD300 V2},
  author={Dhiaz Naidi},
  year={2024},
  url={https://www.kaggle.com/datasets/dhiaznaidi/receiptdatasetssd300v2}
}
```