Florence-2 VQA (Fine-tuned on VizWiz)

Florence-2 is a state-of-the-art Vision-Language Model (VLM) developed by Microsoft, designed to perform a wide range of multimodal tasks, including Visual Question Answering (VQA). This version has been fine-tuned on the VizWiz dataset, which contains real-world visual questions submitted by blind and low-vision users.

Model Description

Model Type: Vision-Language Transformer (Florence-2)
Architecture: Image encoder (ViT) + text decoder
Pretrained by: Microsoft
Fine-tuned on: VizWiz VQA dataset
Framework: PyTorch + Hugging Face Transformers

Intended Uses

This model is specifically optimised for inclusive AI applications, such as assistive technology for visually impaired users. Given an image and a natural language question, the model predicts a textual answer.

Example

import torch
from transformers import AutoModelForCausalLM, AutoProcessor, AutoConfig
from PIL import Image
import requests

# Choose device (GPU preferred if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load model and processor from Hugging Face Hub
model_id = "Zagarsuren/florence2-finetuned-vizwiz"
config = AutoConfig.from_pretrained(model_id, trust_remote_code=True)

# Ensure compatibility with Florence-2 vision encoder
if getattr(config.vision_config, "model_type", None) != "davit":
    config.vision_config.model_type = "davit"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, config=config, trust_remote_code=True).to(device)

# Prepare inputs
task_prompt = "Question answering: "
text_input = "What is written on the sign?"
image_url = "https://example.com/sample.jpg"
image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")

# Preprocess
inputs = processor(text=task_prompt + text_input, images=image, return_tensors="pt").to(device)

# Generate
generated_ids = model.generate(
    input_ids=inputs["input_ids"],
    pixel_values=inputs["pixel_values"],
    max_new_tokens=1024,
    num_beams=3,
)

# Decode and post-process
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
answer = processor.post_process_generation(
    generated_text,
    task=task_prompt,
    image_size=(image.width, image.height)
)

print("Answer:", answer)

Evaluation Results

Metric	Score
Accuracy	58.21%
BLEU-1	0.6386
Response Time CPU (avg)	~10.3s

The model was benchmarked on the VizWiz sample dataset. It performs strongly across categories including yes/no, number, other, and unanswerable questions.

Limitations

Not optimised for synthetic datasets like CLEVR.
Can be computationally heavy (requires GPU for real-time inference).
May produce hallucinated answers if the question is ambiguous or image is of low quality.

Citation

If you use this model, please cite:

@{
  title={VisionAid-VQA: Inclusive Visual Question Answering Using Deep Learning and Multimodal Attention Mechanisms},
  author={Zagarsuren Sukhbaatar},
  year={2025},
  url={https://huggingface.co/Zagarsuren/florence-2-finetuned-vizwiz}
}

License

MIT License

Zagarsuren
/

florence2-finetuned-vizwiz