Florence 2 VQA - Engineering Drawings
Model Overview
The Florence 2 VQA model is fine-tuned for visual question answering (VQA) tasks, specifically for engineering drawings. It takes both an image (e.g., a technical drawing) and a textual question as input, and generates a text-based answer related to the content of the image.
Model Details
- Base Model: microsoft/Florence-2-base-ft
- Task: Visual Question Answering (VQA)
- Architecture: Causal Language Model (CLM)
- Framework: Hugging Face Transformers
How to Use the Model
Install Dependencies
Make sure you have the required libraries installed:
pip install transformers torch datasets pillow gradio
Load the Model
To load the model and processor for inference, use the following code:
from transformers import AutoConfig, AutoModelForCausalLM
import torch
# Determine if a GPU is available and set the device accordingly
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load configuration from the base model
config = AutoConfig.from_pretrained("microsoft/Florence-2-base-ft", trust_remote_code=True)
# Load the model using the base model's configuration
model = AutoModelForCausalLM.from_pretrained(
"fauzail/Florence-2-VQA",
config=config,
trust_remote_code=True
).to(device)
Load the Processor
from transformers import AutoProcessor
# Load the processor for the model
processor = AutoProcessor.from_pretrained("fauzail/Florence-2-VQA", trust_remote_code=True)
Define the Prediction Function
Once the model and processor are loaded, define a prediction function that takes an image and question as input:
def predict(image_path, question):
from PIL import Image
# Load and preprocess the image
image = Image.open(image_path).convert("RGB")
# Prepare inputs using the processor
inputs = processor(text=[question], images=[image], return_tensors="pt", padding=True).to(device)
# Generate the output from the model
outputs = model.generate(**inputs)
# Decode the output tokens into a human-readable format
answer = processor.tokenizer.decode(outputs[0], skip_special_tokens=True)
return answer
Test It for Example
Now, test the model using an image and a question:
image_path = "test.png" # Replace with your image path
question = "Tell me in detail about the image?"
# Call the prediction function
answer = predict(image_path, question)
print("Answer:", answer)
Alternative: Use Gradio for Interactive Web Interface
If you prefer an interactive interface, you can use Gradio to deploy the model:
import gradio as gr
from PIL import Image
# Define the prediction function for Gradio
def predict(image, question):
inputs = processor(text=[question], images=[image], return_tensors="pt", padding=True).to(device)
outputs = model.generate(**inputs)
return processor.tokenizer.decode(outputs[0], skip_special_tokens=True)
# Create the Gradio interface
interface = gr.Interface(
fn=predict,
inputs=["image", "text"],
outputs="text",
title="Florence 2 VQA - Engineering Drawings",
description="Upload an engineering drawing and ask a related question."
)
# Launch the Gradio interface
interface.launch()
Training Details
- Preprocessing:
- Images were resized and normalized.
- Text data (questions and answers) was tokenized using the Florence tokenizer.
- Hyperparameters:
- Learning Rate:
1e-6
- Batch Size:
2
- Gradient Accumulation Steps:
4
- Epochs:
10
- Learning Rate:
Training was performed using mixed precision for efficiency.
- Downloads last month
- 21
Unable to determine this model's library. Check the
docs
.
Model tree for fauzail/Florence-2-VQA
Base model
microsoft/Florence-2-base-ft