File size: 4,776 Bytes
acb33fe 1943cf0 acb33fe 4052ac9 1943cf0 4052ac9 1943cf0 4052ac9 acb33fe 4052ac9 acb33fe 4052ac9 acb33fe 4052ac9 acb33fe 4052ac9 acb33fe 4052ac9 acb33fe 4052ac9 1943cf0 4052ac9 1943cf0 4052ac9 1943cf0 4052ac9 acb33fe fd92704 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 |
---
library_name: peft
license: apache-2.0
base_model: HuggingFaceTB/SmolVLM2-500M-Video-Instruct
tags:
- base_model:adapter:HuggingFaceTB/SmolVLM2-500M-Video-Instruct
- lora
- transformers
- finance
model-index:
- name: Susant-Achary/SmolVLM2-500M-Video-Instruct-VQA2
results:
- task:
type: visual-question-answering
dataset:
type: jinaai/table-vqa
name: jinaai/table-vqa
metrics:
- type: training_loss
value: 0.7473664236068726
datasets:
- jinaai/table-vqa
language:
- en
pipeline_tag: visual-question-answering
---
# SmolVLM2-500M-Video-Instruct-vqav2
This model is a fine-tuned version of [HuggingFaceTB/SmolVLM2-500M-Video-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct) on the [jinaai/table-vqa](https://huggingface.co/datasets/jinaai/table-vqa) dataset.
## Model description
This model is a SmolVLM2-500M-Video-Instruct model fine-tuned for Visual Question Answering on table images using the jinaai/table-vqa dataset. It was fine-tuned using QLoRA for efficient training on consumer GPUs.
## Intended uses & limitations
This model is intended for Visual Question Answering tasks specifically on images containing tables. It can be used to answer questions about the content of tables within images.
Limitations:
- Performance may vary on different types of images or questions outside of the table VQA domain.
- The model was fine-tuned on a small subset of the dataset for demonstration purposes.
- The model's performance is dependent on the quality and nature of the jinaai/table-vqa dataset.
## Training and evaluation data
The model was trained on a subset of the [jinaai/table-vqa](https://huggingface.co/datasets/jinaai/table-vqa) dataset. The training dataset size is 800 examples, and the test dataset size is 200 examples.
## Training procedure
The model was fine-tuned using the QLoRA method with the following configuration:
- `r=8`
- `lora_alpha=8`
- `lora_dropout=0.1`
- `target_modules=['down_proj','o_proj','k_proj','q_proj','gate_proj','up_proj','v_proj']`
- `use_dora=False`
- `init_lora_weights="gaussian"`
- 4-bit quantization (`bnb_4bit_use_double_quant=True`, `bnb_4bit_quant_type="nf4"`, `bnb_4bit_compute_dtype=torch.bfloat16`)
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 4
- eval_batch_size: 8
- seed: 42
- optimizer: Use OptimizerNames.PAGED_ADAMW_8BIT with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 50
- num_epochs: 1
### Direct Use
```python
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoProcessor, Idefics3ForConditionalGeneration, BitsAndBytesConfig
from PIL import Image
import requests
# Define the base model and the fine-tuned adapter repository
base_model_id = "HuggingFaceTB/SmolVLM2-500M-Video-Instruct"
adapter_model_id = "Susant-Achary/SmolVLM2-500M-Video-Instruct-vqav2"
# Load the processor from the base model
processor = AutoProcessor.from_pretrained(base_model_id)
# Load the base model with quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = Idefics3ForConditionalGeneration.from_pretrained(
base_model_id,
quantization_config=bnb_config,
device_map="auto"
)
# Load the adapter and add it to the base model
model = PeftModel.from_pretrained(model, adapter_model_id)
# Prepare an example image and question
# You can replace this with your own image and question
url = "/content/VQA-20-standard-test-set-results-comparison-of-state-of-the-art-methods.png"
image = Image.open(url)
question = "What is in the image?"
# Prepare the input for the model
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "Answer briefly."},
{"type": "image"},
{"type": "text", "text": question}
]
},
{
"role": "assistant",
"content": [
{"type": "text", "text": None}
]
}
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=False)
inputs = processor(text=[prompt], images=[image], return_tensors="pt").to(model.device) # Move inputs to model device
# Generate a response
generated_ids = model.generate(**inputs, max_new_tokens=100)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
# Print the generated response
print(generated_text)
```
### Framework versions
- PEFT 0.16.0
- Transformers 4.53.2
- Pytorch 2.7.1+cu126
- Datasets 4.0.0
- Tokenizers 0.21.2
- bitsandbytes 0.46.1
- num2words |