File size: 4,776 Bytes
acb33fe
 
 
 
 
 
 
 
1943cf0
acb33fe
4052ac9
 
 
 
 
1943cf0
 
4052ac9
1943cf0
 
4052ac9
 
 
 
 
acb33fe
 
 
 
 
4052ac9
acb33fe
 
 
4052ac9
acb33fe
 
 
4052ac9
 
 
 
 
 
acb33fe
 
 
4052ac9
acb33fe
 
 
4052ac9
 
 
 
 
 
 
 
 
acb33fe
 
 
 
 
 
 
 
 
 
 
 
4052ac9
 
 
 
 
 
 
 
 
 
1943cf0
4052ac9
 
 
 
1943cf0
4052ac9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1943cf0
 
4052ac9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
acb33fe
 
 
 
 
 
 
 
fd92704
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
---
library_name: peft
license: apache-2.0
base_model: HuggingFaceTB/SmolVLM2-500M-Video-Instruct
tags:
- base_model:adapter:HuggingFaceTB/SmolVLM2-500M-Video-Instruct
- lora
- transformers
- finance
model-index:
- name: Susant-Achary/SmolVLM2-500M-Video-Instruct-VQA2
  results:
  - task:
      type: visual-question-answering
    dataset:
      type: jinaai/table-vqa
      name: jinaai/table-vqa
    metrics:
    - type: training_loss
      value: 0.7473664236068726
datasets:
- jinaai/table-vqa
language:
- en
pipeline_tag: visual-question-answering
---


# SmolVLM2-500M-Video-Instruct-vqav2

This model is a fine-tuned version of [HuggingFaceTB/SmolVLM2-500M-Video-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct) on the [jinaai/table-vqa](https://huggingface.co/datasets/jinaai/table-vqa) dataset.

## Model description

This model is a SmolVLM2-500M-Video-Instruct model fine-tuned for Visual Question Answering on table images using the jinaai/table-vqa dataset. It was fine-tuned using QLoRA for efficient training on consumer GPUs.

## Intended uses & limitations

This model is intended for Visual Question Answering tasks specifically on images containing tables. It can be used to answer questions about the content of tables within images.

Limitations:
- Performance may vary on different types of images or questions outside of the table VQA domain.
- The model was fine-tuned on a small subset of the dataset for demonstration purposes.
- The model's performance is dependent on the quality and nature of the jinaai/table-vqa dataset.

## Training and evaluation data

The model was trained on a subset of the [jinaai/table-vqa](https://huggingface.co/datasets/jinaai/table-vqa) dataset. The training dataset size is 800 examples, and the test dataset size is 200 examples.

## Training procedure

The model was fine-tuned using the QLoRA method with the following configuration:
- `r=8`
- `lora_alpha=8`
- `lora_dropout=0.1`
- `target_modules=['down_proj','o_proj','k_proj','q_proj','gate_proj','up_proj','v_proj']`
- `use_dora=False`
- `init_lora_weights="gaussian"`
- 4-bit quantization (`bnb_4bit_use_double_quant=True`, `bnb_4bit_quant_type="nf4"`, `bnb_4bit_compute_dtype=torch.bfloat16`)

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 4
- eval_batch_size: 8
- seed: 42
- optimizer: Use OptimizerNames.PAGED_ADAMW_8BIT with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 50
- num_epochs: 1

### Direct Use
```python
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoProcessor, Idefics3ForConditionalGeneration, BitsAndBytesConfig
from PIL import Image
import requests

# Define the base model and the fine-tuned adapter repository
base_model_id = "HuggingFaceTB/SmolVLM2-500M-Video-Instruct"
adapter_model_id = "Susant-Achary/SmolVLM2-500M-Video-Instruct-vqav2"

# Load the processor from the base model
processor = AutoProcessor.from_pretrained(base_model_id)

# Load the base model with quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = Idefics3ForConditionalGeneration.from_pretrained(
    base_model_id,
    quantization_config=bnb_config,
    device_map="auto"
)

# Load the adapter and add it to the base model
model = PeftModel.from_pretrained(model, adapter_model_id)

# Prepare an example image and question
# You can replace this with your own image and question
url = "/content/VQA-20-standard-test-set-results-comparison-of-state-of-the-art-methods.png"
image = Image.open(url)
question = "What is in the image?"

# Prepare the input for the model
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Answer briefly."},
            {"type": "image"},
            {"type": "text", "text": question}
        ]
    },
    {
        "role": "assistant",
        "content": [
            {"type": "text", "text": None}
        ]
    }
]

prompt = processor.apply_chat_template(messages, add_generation_prompt=False)
inputs = processor(text=[prompt], images=[image], return_tensors="pt").to(model.device) # Move inputs to model device

# Generate a response
generated_ids = model.generate(**inputs, max_new_tokens=100)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

# Print the generated response
print(generated_text)
```


### Framework versions

- PEFT 0.16.0
- Transformers 4.53.2
- Pytorch 2.7.1+cu126
- Datasets 4.0.0
- Tokenizers 0.21.2
- bitsandbytes 0.46.1
- num2words