ViLT VQA (Fine-tuned on VizWiz)

This model is a fine-tuned version of ViLT (Vision-and-Language Transformer) on the VizWiz dataset—a collection of real-world visual questions submitted by blind and visually impaired users.

ViLT is a lightweight and efficient VLM that aligns text and image embeddings via a transformer encoder without using an explicit visual feature extractor (e.g. CNN or ViT), resulting in faster inference and reduced computational cost.

Model Details

  • Base Model: dandelin/vilt-b32-finetuned-vqa
  • Fine-tuned on: Sample of VizWiz VQA dataset
  • Framework: Hugging Face Transformers (PyTorch)
  • Use Case: Assistive VQA systems for accessibility and inclusion

Intended Use

Designed for Visual Question Answering in practical, assistive settings. Suitable for low-latency deployments where model speed is critical.

Example Usage

from transformers import ViltProcessor, ViltForQuestionAnswering
from PIL import Image
import requests

processor = ViltProcessor.from_pretrained("Zagarsuren/vilt-finetuned-vizwiz")
model = ViltForQuestionAnswering.from_pretrained("Zagarsuren/vilt-finetuned-vizwiz")

image = Image.open(requests.get("https://example.com/image.jpg", stream=True).raw)
question = "What colour is the jacket?"

encoding = processor(image, question, return_tensors="pt")
outputs = model(**encoding)
predicted_answer = model.config.id2label[outputs.logits.argmax(-1).item()]
print(predicted_answer)

Evaluation Results

Metric Score
Accuracy 29.01%
BLEU-1 0.3017
Response Time (avg) 13.93ms

ViLT offers faster inference than larger VLMs (e.g. Florence-2), making it ideal for edge deployment and resource-constrained environments. However, its performance is comparatively lower on unanswerable and complex reasoning tasks.

Limitations

  • Weaker performance on complex and compositional reasoning
  • Struggles with low-quality or cluttered images typical in VizWiz
  • May produce uncertain answers for ambiguous or unanswerable questions

Citation

If you use this model, please cite:

@{
  title={VisionAid-VQA: Inclusive Visual Question Answering Using Deep Learning and Multimodal Attention Mechanisms},
  author={Zagarsuren Sukhbaatar},
  year={2025},
  url={https://huggingface.co/Zagarsuren/vilt-finetuned-vizwiz}
}

License

MIT License

Downloads last month
6
Safetensors
Model size
118M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results