ViLT VQA (Fine-tuned on VizWiz)
This model is a fine-tuned version of ViLT (Vision-and-Language Transformer) on the VizWiz dataset—a collection of real-world visual questions submitted by blind and visually impaired users.
ViLT is a lightweight and efficient VLM that aligns text and image embeddings via a transformer encoder without using an explicit visual feature extractor (e.g. CNN or ViT), resulting in faster inference and reduced computational cost.
Model Details
- Base Model:
dandelin/vilt-b32-finetuned-vqa
- Fine-tuned on: Sample of VizWiz VQA dataset
- Framework: Hugging Face Transformers (PyTorch)
- Use Case: Assistive VQA systems for accessibility and inclusion
Intended Use
Designed for Visual Question Answering in practical, assistive settings. Suitable for low-latency deployments where model speed is critical.
Example Usage
from transformers import ViltProcessor, ViltForQuestionAnswering
from PIL import Image
import requests
processor = ViltProcessor.from_pretrained("Zagarsuren/vilt-finetuned-vizwiz")
model = ViltForQuestionAnswering.from_pretrained("Zagarsuren/vilt-finetuned-vizwiz")
image = Image.open(requests.get("https://example.com/image.jpg", stream=True).raw)
question = "What colour is the jacket?"
encoding = processor(image, question, return_tensors="pt")
outputs = model(**encoding)
predicted_answer = model.config.id2label[outputs.logits.argmax(-1).item()]
print(predicted_answer)
Evaluation Results
Metric | Score |
---|---|
Accuracy | 29.01% |
BLEU-1 | 0.3017 |
Response Time (avg) | 13.93ms |
ViLT offers faster inference than larger VLMs (e.g. Florence-2), making it ideal for edge deployment and resource-constrained environments. However, its performance is comparatively lower on unanswerable and complex reasoning tasks.
Limitations
- Weaker performance on complex and compositional reasoning
- Struggles with low-quality or cluttered images typical in VizWiz
- May produce uncertain answers for ambiguous or unanswerable questions
Citation
If you use this model, please cite:
@{
title={VisionAid-VQA: Inclusive Visual Question Answering Using Deep Learning and Multimodal Attention Mechanisms},
author={Zagarsuren Sukhbaatar},
year={2025},
url={https://huggingface.co/Zagarsuren/vilt-finetuned-vizwiz}
}
License
MIT License
- Downloads last month
- 6
Evaluation results
- Accuracy on VizWizself-reported29.01%
- BLEU-1 on VizWizself-reported0.302