--- license: mit language: - en base_model: - google/paligemma-3b-pt-896 pipeline_tag: image-to-text --- # Google/paligemma2-3b-pt-896 model fine-tuned for US IRS Form 1040 (2023) data parsing and extraction The repository only provides Peft LORA weights. The lora layers have been fine-tuned to parse and extract data from IRS (US) tax form 1040 (year 2023) first page only. It performs OCR and returns extracted data in JSON format using zero shot prompt. ```python from PIL import Image import torch import json from transformers import PaliGemmaForConditionalGeneration, AutoProcessor from peft import PeftModel model_id = 'google/paligemma-3b-pt-896' peft_model_id = 'hsarfraz/google-paligemma-irs-form-1040-2023-parser-pg1' device = "cuda:0" if torch.cuda.is_available() else "cpu" # load base model processor = AutoProcessor.from_pretrained(model_id,padding_side = "right",add_eos_token = True) model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, device_map={"":0}, torch_dtype=torch.bfloat16) # load fine-tuned peft weights fine_tuned_model = PeftModel.from_pretrained(model, peft_model_id) fine_tuned_model.to(device) # prompt for OCR prompt = "extract data in JSON format" # path to local image file image_file = '' image = Image.open(image_file) # get tokens inputs = processor(images=image, text=prompt, return_tensors="pt").to(device) prefix_length = inputs["input_ids"].shape[-1] #switch to inference mode with torch.inference_mode(): generation = fine_tuned_model.generate(**inputs, max_new_tokens=1152) generation = generation[0][prefix_length:] decoded = processor.decode(generation, skip_special_tokens=True) # parse output as json try: output_json =json.dumps(json.loads(decoded), indent=4) except (Exception) as error: print('Error: %s' % error) output_json = decoded # display parsed json print(output_json) ``` # Fake Synthetic Data for IRS 1040 2023 Form Page 1
fake form
# Parsed output in json ```json { "lbl_0_03": "Andrew Huffman", "lbl_0_04": "Phillips", "lbl_0_05": "247-27-3525", "lbl_0_06": "Martin", "lbl_0_08": "797-83-3491", "lbl_0_09": "PSC 8861, Box 7908 APO AE 15945", "lbl_0_11": "Andrewhaven", "lbl_0_12": "IA", "lbl_0_13": "16560", "lbl_0_55": "504583.65", "lbl_0_66": "473782.31", "lbl_0_67": "626674.66", "lbl_0_79": "559436.54" } ```