Model Card for aryamanpathak/blip-vqa-abo

This is a fine-tuned BLIP (Bootstrapping Language-Image Pre-training) model specifically adapted for object-centric Visual Question Answering (VQA) with a focus on generating single-word answers. Using Parameter-Efficient Fine-tuning (PEFT) with Low-Rank Adaptation (LoRA), the model is trained to answer natural language questions about the content of provided images, primarily aiming for concise, single-word responses about objects based on the Amazon Berkeley Objects (ABO) dataset.

Model Details

Model Description

This model is a fine-tuned version of the Salesforce/blip-vqa-base model, specialized for an object-centric VQA task using the Amazon Berkeley Objects (ABO) dataset. A key characteristic of this fine-tuning is its focus on enabling the model to provide single-word answers to questions about objects and their properties. The fine-tuning was performed using Low-Rank Adaptation (LoRA), a parameter-efficient technique that injects small, trainable matrices into the pre-trained model's layers. This approach allows for efficient adaptation of the large BLIP model without the need to fine-tune all of its parameters, significantly reducing computational requirements while aiming to maintain strong performance on the specific target VQA domain and answer format. The model takes an image and a natural language question as input and generates a textual answer, primarily optimized for brevity.

Developed by: Aryaman, Rutul, Shreyas
Model type: Vision-Language Model (VLM), specifically designed for Visual Question Answering (VQA) with single-word answer generation.
Language(s) (NLP): English
Finetuned from model: Salesforce/blip-vqa-base

Downstream Use 1

This model is suitable for integration into applications that require VQA capabilities focused on objects and where a concise, single-word answer is acceptable or preferred. Potential downstream uses include:

Rapid object identification in visual search systems.
Simple query responses about object properties in VR environments.
Initial classification steps based on visual questions.

Out-of-Scope Use

The model is not recommended for:

Generating long, descriptive, or multi-sentence responses. It is specifically fine-tuned for single-word answers.
VQA on images significantly different from the object-centric domain of the ABO dataset (e.g., abstract art, medical images, complex scenes requiring deep contextual understanding beyond simple object properties).
Tasks other than VQA (e.g., image captioning, object detection, visual grounding without a question, classification).
Applications requiring extremely high precision or safety-critical decisions, due to the observed Exact Match limitations and potential biases inherited from the data.
Handling questions or languages other than English.

Bias, Risks, and Limitations

Like all large models trained on vast datasets, this model may exhibit biases present in the pre-training data (Salesforce/blip-vqa-base) and the custom fine-tuning dataset (Amazon Berkeley Objects dataset). Potential biases could relate to the specific objects represented in the ABO dataset, the attributes commonly associated with them, or the phrasing of questions and the corresponding single-word answers.

Specific limitations observed during evaluation include:

A relatively low Exact Match (EM) score (20.00%), which is particularly challenging for single-word answers as there is no partial credit for semantic similarity if the word is not an exact match.
Performance is highly dependent on the domain covered by the Amazon Berkeley Objects dataset. The model may struggle with questions about objects, attributes, or scenarios not adequately represented in the training batches.
The model is explicitly trained for single-word answers, enforced by padding/truncation of labels to 10 tokens during training (which effectively limits output length significantly). It will not be able to generate multi-word descriptions or complex answers.
The high BERTScore F1 (0.9463) should be interpreted in the context of single-word answers, where semantic matching can be inherently higher compared to matching longer phrases.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model, particularly its domain dependency on the Amazon Berkeley Objects dataset and its limitation to single-word answers. It is highly recommended to perform thorough evaluation on representative data for the intended use case and ensure that single-word answers are sufficient for the application. Evaluate the model's outputs critically, especially in sensitive applications.

How to Get Started with the Model

You can easily load and use this model for inference using the Hugging Face transformers and peft libraries. Ensure you have these libraries installed (pip install transformers peft torch Pillow).

from transformers import BlipProcessor, BlipForQuestionAnswering
from peft import PeftModel # Import PeftModel to load LoRA fine-tuned model

# Define the model ID on Hugging Face Hub
model_id = "[Your-HuggingFace-Username]/[Your-Model-Name]" # <--- REPLACE WITH YOUR MODEL ID

# Load the processor and the base model first
processor = BlipProcessor.from_pretrained(model_id)
# For LoRA models saved with push_to_hub, you load the base model,
# and the PEFT weights/config are automatically handled if the repo structure is standard.
# If saved differently, you might need to load the base model separately
# and then load the PEFT adapters using PeftModel.from_pretrained(model_id).
# Assuming standard saving with push_to_hub or save_pretrained on the LoRA model:
model = BlipForQuestionAnswering.from_pretrained(model_id)

# Move model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
model.eval() # Set the model to evaluation mode

# Example Inference
from PIL import Image
import requests # Or load image from a local file path
import os

# Example loading from a local file:
image_path = "/path/to/your/local/image.jpg" # <--- REPLACE WITH YOUR IMAGE PATH
if not os.path.exists(image_path):
    print(f"Error: Image not found at {image_path}")
    exit()

try:
    image = Image.open(image_path).convert("RGB")
except Exception as e:
    print(f"Error loading image: {e}")
    exit()

# Example questions expected to have single-word answers based on the ABO dataset domain
question = "What color is this object?" # <--- REPLACE WITH YOUR QUESTION
# Or: question = "Is this made of wood?"
# Or: question = "How many items are there?"

# Prepare inputs using the processor
encoding = processor(image, question, return_tensors="pt").to(device)

# Generate the answer using the model.generate method
# max_new_tokens should be set to a small value, like 1-5, given the single-word focus
with torch.no_grad():
    generated_ids = model.generate(
        input_ids=encoding.input_ids,
        pixel_values=encoding.pixel_values,
        attention_mask=encoding.attention_mask if 'attention_mask' in encoding else None,
        max_new_tokens=5 # Generate a small number of tokens as it's single-word VQA
    )

# Decode the generated token IDs back into a string
generated_answer = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(f"Question: {question}")
print(f"Generated Answer: {generated_answer}")

# You can also use the batch inference script provided in the repository
# [Link to your inference script here, e.g., 'run_inference.py']

Training Details

Training Data

The model was fine-tuned on the Amazon Berkeley Objects (ABO) dataset, specifically using a curated subset or format containing image-question-answer triplets focused on objects. A key characteristic of this dataset for this project is that the ground truth answers are primarily single words. The dataset was logically partitioned into 14 'master' batches for iterative training due to its size.

Training Procedure

The Salesforce/blip-vqa-base model was fine-tuned using the LoRA (Low-Rank Adaptation) technique, managed by the Hugging Face peft library.

Preprocessing [optional]

Data preprocessing was handled by a custom VQADataset PyTorch class. This class was responsible for loading images using Pillow (ensuring RGB format), parsing JSON annotation files to extract image paths and QA pairs, constructing full image file paths (including specific cleaning logic), and finally using the BlipProcessor to prepare the data into model-ready tensors. This involved tokenizing questions and answers. Crucially, the ground truth answers (labels) were padded/truncated to a max_length of 10 tokens. This setting reflects the objective of generating very short answers, suitable for single-word responses. Questions (inputs) were padded/truncated to max_length=128.

Training Hyperparameters

Optimizer: AdamW
Learning Rate: 10e-5 (or 1e-4)
LoRA Configuration:
- r = 8 (Rank of LoRA matrices)
- lora_alpha = 32 (Scaling factor)
- target_modules = ['qkv', 'projection'] (Modules targeted for LoRA)
- lora_dropout = 0.05 (Dropout probability)
- bias = 'none' (Bias terms not trained with LoRA)
Training regime: Likely fp16 mixed precision (as suggested by the use of torch.amp.autocast with torch.float16 in evaluation, often used for efficiency in training on T4 GPUs).

Speeds, Sizes, Times

The model was trained iteratively over 14 master batches of data from the ABO dataset. Checkpoints were saved periodically during training (e.g., every 1000 steps) and at the end of each master batch. The training progress was monitored through loss history and evaluation metrics (EM and BERTScore F1) on a global test set after each batch.

Total Training Time: [Estimate the total time spent across all 14 batches if possible, otherwise state "Iterative training was performed across 14 batches."]
Checkpoint Size: [Provide the size of the saved model directory, e.g., "approx. XX MB"]
Inference Speed (on test set): 33.99 iterations/second (on a Tesla T4 GPU)

Evaluation

Testing Data, Factors & Metrics

Testing Data

Evaluation was performed on a constant, global test dataset distinct from the training batches. This test set consisted of 482,036 samples (image-question-answer triplets) derived from the Amazon Berkeley Objects (ABO) dataset, similar in structure and domain to the training data, with ground truth answers primarily being single words.

Factors

Evaluation results are reported on the aggregate global test dataset. No specific disaggregation by factors (e.g., question type, object category) was performed for the reported metrics.

Metrics

The model's performance was evaluated using two primary metrics, chosen for assessing single-word VQA performance:

Exact Match (EM): Measures the percentage of generated answers that are character-for-character identical to the ground truth answers after stripping leading/trailing whitespace and lowercasing. This is a strict metric for single-word answers.
BERTScore: A metric that assesses the semantic similarity between the generated answers and the ground truth answers using contextual embeddings from a pre-trained BERT model. Reported as Precision (P), Recall (R), and F1 scores. While the ground truth answers are single words, BERTScore can still provide a measure of how semantically close the predicted word is to the expected word in context. The high BERTScore F1 should be considered in this context of single-word targets.

Results

Evaluation was performed on the global test dataset after completing training on the final master batch.

Metric	Value
Exact Match (EM)	20.00%
BERTScore Precision	0.9586
BERTScore Recall	0.9361
BERTScore F1	0.9463

Summary

The evaluation results on the ABO test dataset indicate that the fine-tuned model achieves an Exact Match score of 20.00% for the single-word VQA task. While the EM is moderate, the high BERTScore F1 of 0.9463 suggests that the model's generated single-word answers are semantically very similar to the ground truth words. This indicates that the model has learned to identify relevant concepts, even if the exact wording doesn't always match the reference. The iterative training process across 14 batches demonstrated an increasing trend in both EM and BERTScore F1 on the global test set, indicating successful adaptation to the object-centric VQA task with concise answers over time.

Model Examination [optional]

[Optional: Describe any model interpretability work here, such as analyzing attention maps, looking at failure cases (e.g., common types of incorrect single-word answers), etc.]

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: NVIDIA Tesla T4 GPU
Hours used: [Estimate based on your total training time]
Cloud Provider: Kaggle platform (runs on Google Cloud infrastructure)
Compute Region: Kaggle platform region [Specify if you know the exact region, otherwise state "Region abstracted by Kaggle"]
Carbon Emitted: [Estimate using the calculator with the above information and your hours used]

Technical Specifications [optional]

Model Architecture and Objective

The model architecture is based on Salesforce/blip-vqa-base, which combines a Vision Transformer (ViT) for image encoding and a language model (based on BERT) for text processing and answer generation. The fine-tuning process, utilizing LoRA adapters applied to key modules (specifically 'qkv' and 'projection' in attention layers), aimed to minimize the cross-entropy loss between the model's generated answer token sequences and the ground truth answer token sequences. The training data specifically guided the model towards generating single-word answers, reinforced by the label tokenization setup.

Compute Infrastructure

Hardware

The model was trained and evaluated on hardware typically available through the Kaggle platform, specifically utilizing an NVIDIA Tesla T4 GPU.

Software

The project utilizes Python 3.11 and key libraries from the PyTorch and Hugging Face ecosystems, including:

PyTorch (for tensor computation and neural networks)
transformers (for the BLIP model and processor)
peft (for LoRA implementation)
datasets (for dataset handling, although a custom VQADataset was used)
accelerate (for distributed training/evaluation preparation)
Pillow (for image loading)
bitsandbytes (likely used for efficient loading/quantization)
bert-score (for evaluation metric calculation)
scikit-learn (for general utilities, potentially used in evaluation)

Citation

BibTeX:

@article{Li2022BLIPBE,
  title={BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation},
  author={Junnan Li and Dongxu Li and Caiming Xiong and Steven Hoi},
  journal={ArXiv},
  year={2022},
  volume={abs/2201.05923},
  url={https://api.semanticscholar.org/CorpusID:245861095}
}

APA: Li, J., Li, D., Xiong, C., & Hoi, S. (2022). BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. ArXiv, abs/2201.05923.

Glossary

VQA: Visual Question Answering - The task of answering a natural language question about the content of an image.
BLIP: Bootstrapping Language-Image Pre-training - The base vision-language model architecture.
PEFT: Parameter-Efficient Fine-Tuning - Techniques that allow efficient fine-tuning of large pre-trained models by training only a small number of additional parameters.
LoRA: Low-Rank Adaptation - A specific PEFT method involving the injection and training of low-rank update matrices.
Amazon Berkeley Objects (ABO) Dataset: The object-centric dataset used for fine-tuning this model.
Exact Match (EM): An evaluation metric where a prediction is correct only if it exactly matches the ground truth string (specifically relevant for single-word answers here).
BERTScore: An evaluation metric that measures semantic similarity between text sequences using contextualized embeddings.

More Information [optional]

[Optional: Add links to related work, dataset details, or any further information about the project or model.]

Model Card Authors

Aryaman
Rutul
Shreyas

Model Card Contact

[Your email address or preferred contact method]

Framework versions

PEFT 0.14.0
transformers [Specify version from your environment, e.g., 4.51.1]
torch [Specify version from your environment, e.g., 2.5.1+cu124]
datasets [Specify version from your environment, e.g., 3.5.0]
accelerate [Specify version from your environment, e.g., 1.3.0]
Pillow [Specify version from your environment, e.g., 11.1.0]

aryamanpathak
/

blip-vqa-abo