Florence-2 VQA (Fine-tuned on VizWiz)
Florence-2 is a state-of-the-art Vision-Language Model (VLM) developed by Microsoft, designed to perform a wide range of multimodal tasks, including Visual Question Answering (VQA). This version has been fine-tuned on the VizWiz dataset, which contains real-world visual questions submitted by blind and low-vision users.
Model Description
- Model Type: Vision-Language Transformer (Florence-2)
- Architecture: Image encoder (ViT) + text decoder
- Pretrained by: Microsoft
- Fine-tuned on: VizWiz VQA dataset
- Framework: PyTorch + Hugging Face Transformers
Intended Uses
This model is specifically optimised for inclusive AI applications, such as assistive technology for visually impaired users. Given an image and a natural language question, the model predicts a textual answer.
Example
import torch
from transformers import AutoModelForCausalLM, AutoProcessor, AutoConfig
from PIL import Image
import requests
# Choose device (GPU preferred if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load model and processor from Hugging Face Hub
model_id = "Zagarsuren/florence2-finetuned-vizwiz"
config = AutoConfig.from_pretrained(model_id, trust_remote_code=True)
# Ensure compatibility with Florence-2 vision encoder
if getattr(config.vision_config, "model_type", None) != "davit":
config.vision_config.model_type = "davit"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, config=config, trust_remote_code=True).to(device)
# Prepare inputs
task_prompt = "Question answering: "
text_input = "What is written on the sign?"
image_url = "https://example.com/sample.jpg"
image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
# Preprocess
inputs = processor(text=task_prompt + text_input, images=image, return_tensors="pt").to(device)
# Generate
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=1024,
num_beams=3,
)
# Decode and post-process
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
answer = processor.post_process_generation(
generated_text,
task=task_prompt,
image_size=(image.width, image.height)
)
print("Answer:", answer)
Evaluation Results
Metric | Score |
---|---|
Accuracy | 58.21% |
BLEU-1 | 0.6386 |
Response Time CPU (avg) | ~10.3s |
The model was benchmarked on the VizWiz sample dataset. It performs strongly across categories including yes/no, number, other, and unanswerable questions.
Limitations
- Not optimised for synthetic datasets like CLEVR.
- Can be computationally heavy (requires GPU for real-time inference).
- May produce hallucinated answers if the question is ambiguous or image is of low quality.
Citation
If you use this model, please cite:
@{
title={VisionAid-VQA: Inclusive Visual Question Answering Using Deep Learning and Multimodal Attention Mechanisms},
author={Zagarsuren Sukhbaatar},
year={2025},
url={https://huggingface.co/Zagarsuren/florence-2-finetuned-vizwiz}
}
License
MIT License
- Downloads last month
- 17
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Evaluation results
- Accuracy on VizWizself-reported58.210
- BLEU-1 on VizWizself-reported0.639