farhamu
/

bart-base-receipt-parser-1

text2text-generation

receipt-parsing

information-extraction

Model card Files Files and versions

bart-base-receipt-parser-1 / README.md

farhamu's picture

update repo url

afb78c2 verified about 2 months ago

|

history blame contribute delete

3.26 kB

	---
	language:
	- en
	- id
	- my
	license: apache-2.0
	tags:
	- receipt-parsing
	- information-extraction
	- bart
	- ocr
	- text-to-text
	datasets:
	- dhiaznaidi/receiptdatasetssd300v2
	library_name: transformers
	pipeline-tag:
	---

	# BART-base Receipt Parser

	## Model Description

	This model is a fine-tuned version of [facebook/bart-base](https://huggingface.co/facebook/bart-base) for receipt parsing tasks. The model is trained to extract key information from receipt text, specifically:

	- Date: Transaction date from the receipt
	- Company Name: Name of the merchant/store
	- Total Amount: Final amount paid

	## Dataset

	The model was trained using the [Receipt Dataset SSD300 V2](https://www.kaggle.com/datasets/dhiaznaidi/receiptdatasetssd300v2) from Kaggle, which contains receipt images with corresponding labels.

	## Data Processing Pipeline

	1. OCR Processing: All receipt images from the dataset were processed using [EasyOCR](https://github.com/JaidedAI/EasyOCR) to extract raw text
	2. Input-Output Mapping: The extracted OCR text serves as input, while the labeled data from the Kaggle dataset serves as the target output
	3. Fine-tuning: Supervised fine-tuning was performed on the facebook/bart-base model

	## Usage

	```python
	from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

	# Load model and tokenizer
	model = AutoModelForSeq2SeqLM.from_pretrained("farhamu/bart-base-receipt-parser-v1")
	tokenizer = AutoTokenizer.from_pretrained("farhamu/bart-base-receipt-parser-v1")

	# Example usage
	receipt_text = """
	SUPERMARKET ABC
	123 Main Street
	City, State 12345
	Date: 2024-01-15
	Item 1: $5.99
	Item 2: $3.50
	Tax: $0.76
	Total: $10.25
	Thank you for shopping!
	"""

	# Tokenize input
	inputs = tokenizer(receipt_text, return_tensors="pt", max_length=512, truncation=True, padding=True)

	# Generate output
	outputs = model.generate(
	**inputs,
	max_length=150,
	num_beams=4,
	early_stopping=True
	)

	# Decode result
	result = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(result)
	```

	## Expected Output Format

	The model outputs structured information in the following format:
	```
	Date: [extracted_date]
	Company Name: [extracted_company_name]
	Total Amount: [extracted_total_amount]
	```

	## Training Details

	- Base Model: facebook/bart-base
	- Task: Text-to-Text Generation (Receipt Information Extraction)
	- Training Data: OCR-processed receipt text with labeled ground truth
	- Data Source: [Receipt Dataset SSD300 V2](https://www.kaggle.com/datasets/dhiaznaidi/receiptdatasetssd300v2)
	- OCR Tool: EasyOCR

	## Limitations

	- Performance may vary depending on OCR quality
	- Trained specifically on the format and style of receipts in the training dataset
	- May require additional fine-tuning for receipts with significantly different formats or languages

	## Use Cases

	- Automated receipt processing for expense management
	- Financial document digitization
	- Retail analytics and data extraction
	- Accounting automation

	## Citation

	If you use this model, please cite the original dataset:

	```bibtex
	@dataset{dhiaznaidi2024receipt,
	title={Receipt Dataset SSD300 V2},
	author={Dhiaz Naidi},
	year={2024},
	url={https://www.kaggle.com/datasets/dhiaznaidi/receiptdatasetssd300v2}
	}
	```