--- library_name: peft license: apache-2.0 base_model: HuggingFaceTB/SmolVLM2-500M-Video-Instruct tags: - base_model:adapter:HuggingFaceTB/SmolVLM2-500M-Video-Instruct - lora - transformers - finance model-index: - name: Susant-Achary/SmolVLM2-500M-Video-Instruct-VQA2 results: - task: type: visual-question-answering dataset: type: jinaai/table-vqa name: jinaai/table-vqa metrics: - type: training_loss value: 0.7473664236068726 datasets: - jinaai/table-vqa language: - en pipeline_tag: visual-question-answering --- # SmolVLM2-500M-Video-Instruct-vqav2 This model is a fine-tuned version of [HuggingFaceTB/SmolVLM2-500M-Video-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct) on the [jinaai/table-vqa](https://huggingface.co/datasets/jinaai/table-vqa) dataset. ## Model description This model is a SmolVLM2-500M-Video-Instruct model fine-tuned for Visual Question Answering on table images using the jinaai/table-vqa dataset. It was fine-tuned using QLoRA for efficient training on consumer GPUs. ## Intended uses & limitations This model is intended for Visual Question Answering tasks specifically on images containing tables. It can be used to answer questions about the content of tables within images. Limitations: - Performance may vary on different types of images or questions outside of the table VQA domain. - The model was fine-tuned on a small subset of the dataset for demonstration purposes. - The model's performance is dependent on the quality and nature of the jinaai/table-vqa dataset. ## Training and evaluation data The model was trained on a subset of the [jinaai/table-vqa](https://huggingface.co/datasets/jinaai/table-vqa) dataset. The training dataset size is 800 examples, and the test dataset size is 200 examples. ## Training procedure The model was fine-tuned using the QLoRA method with the following configuration: - `r=8` - `lora_alpha=8` - `lora_dropout=0.1` - `target_modules=['down_proj','o_proj','k_proj','q_proj','gate_proj','up_proj','v_proj']` - `use_dora=False` - `init_lora_weights="gaussian"` - 4-bit quantization (`bnb_4bit_use_double_quant=True`, `bnb_4bit_quant_type="nf4"`, `bnb_4bit_compute_dtype=torch.bfloat16`) ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 0.0001 - train_batch_size: 4 - eval_batch_size: 8 - seed: 42 - optimizer: Use OptimizerNames.PAGED_ADAMW_8BIT with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments - lr_scheduler_type: linear - lr_scheduler_warmup_steps: 50 - num_epochs: 1 ### Direct Use ```python import torch from peft import PeftModel, PeftConfig from transformers import AutoProcessor, Idefics3ForConditionalGeneration, BitsAndBytesConfig from PIL import Image import requests # Define the base model and the fine-tuned adapter repository base_model_id = "HuggingFaceTB/SmolVLM2-500M-Video-Instruct" adapter_model_id = "Susant-Achary/SmolVLM2-500M-Video-Instruct-vqav2" # Load the processor from the base model processor = AutoProcessor.from_pretrained(base_model_id) # Load the base model with quantization bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 ) model = Idefics3ForConditionalGeneration.from_pretrained( base_model_id, quantization_config=bnb_config, device_map="auto" ) # Load the adapter and add it to the base model model = PeftModel.from_pretrained(model, adapter_model_id) # Prepare an example image and question # You can replace this with your own image and question url = "/content/VQA-20-standard-test-set-results-comparison-of-state-of-the-art-methods.png" image = Image.open(url) question = "What is in the image?" # Prepare the input for the model messages = [ { "role": "user", "content": [ {"type": "text", "text": "Answer briefly."}, {"type": "image"}, {"type": "text", "text": question} ] }, { "role": "assistant", "content": [ {"type": "text", "text": None} ] } ] prompt = processor.apply_chat_template(messages, add_generation_prompt=False) inputs = processor(text=[prompt], images=[image], return_tensors="pt").to(model.device) # Move inputs to model device # Generate a response generated_ids = model.generate(**inputs, max_new_tokens=100) generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] # Print the generated response print(generated_text) ``` ### Framework versions - PEFT 0.16.0 - Transformers 4.53.2 - Pytorch 2.7.1+cu126 - Datasets 4.0.0 - Tokenizers 0.21.2 - bitsandbytes 0.46.1 - num2words