QWen2.5VL Fine-tuned for FLARE 2025 Medical Image Analysis
This model is a fine-tuned version of Qwen/Qwen2.5-VL-7B-Instruct specifically optimized for medical image analysis tasks in the FLARE 2025 2D Medical Multimodal Dataset challenge.
Model Description
- Base Model: Qwen2.5-VL-7B-Instruct
- Fine-tuning Method: QLoRA (Low-Rank Adaptation)
- Target Domain: Medical imaging across 8 modalities (Clinical, Dermatology, Endoscopy, Mammography, Microscopy, Retinography, Ultrasound, Xray)
- Tasks: Medical image captioning, visual question answering, report generation
- Training Data: 19 FLARE 2025 datasets with comprehensive medical annotations
Training Details
Training Data
The model was fine-tuned on 19 diverse medical imaging datasets from FLARE 2025, details can be found at: https://huggingface.co/datasets/FLARE-MedFM/FLARE-Task5-MLLM-2D
Training Configuration
# LoRA Configuration
lora_r: 16
lora_alpha: 32
lora_dropout: 0.1
target_modules: ['k_proj', 'v_proj', 'o_proj', 'gate_proj', 'down_proj', 'up_proj', 'q_proj']
task_type: CAUSAL_LM
# Training Statistics
total_steps: 1000
learning_rate: N/A
final_eval_loss: 5.4849
Training Procedure
- Optimization: 4-bit quantization with BitsAndBytesConfig
- LoRA Configuration:
- r=64, alpha=16, dropout=0.1
- Target modules: All linear layers
- Memory Optimization: Gradient checkpointing, flash attention
- Batch Size: Dynamic based on image resolution
- Learning Rate: 1e-4 with cosine scheduling
- Training Steps: 1000 steps with evaluation every 500 steps
Model Performance
This model has been evaluated across multiple medical imaging tasks with the following capabilities:
- Image Captioning: Generates detailed medical reports from imaging studies
- Visual Question Answering: Answers clinical questions about medical images
- Diagnosis Support: Identifies pathological findings and abnormalities
- Multi-modal Understanding: Integrates visual and textual medical information
Evaluation Metrics
The model is evaluated using task-specific metrics following FLARE 2025 specifications:
Classification Tasks:
- Balanced Accuracy (PRIMARY): Handles class imbalance in medical diagnosis
- Accuracy: Standard classification accuracy
- F1 Score: Weighted F1 for multi-class scenarios
Multi-label Classification:
- F1 Score (PRIMARY): Sample-wise F1 across multiple labels
- Precision: Label prediction precision
- Recall: Label coverage recall
Detection Tasks:
- F1 Score @ IoU > 0.5 (PRIMARY): Standard computer vision detection metric
- Precision: Detection precision at IoU threshold
- Recall: Detection recall at IoU threshold
Instance Detection (Identity-Aware):
- F1 Score @ IoU > 0.3 (PRIMARY): Medical imaging standard for chromosome detection
- F1 Score @ IoU > 0.5: Computer vision standard
- Average F1: COCO-style average across IoU thresholds (0.3-0.7)
- Per-chromosome metrics: Detailed breakdown by chromosome identity
Counting Tasks:
- Mean Absolute Error (PRIMARY): Cell counting accuracy
- Root Mean Squared Error: Additional counting precision metric
Regression Tasks:
- Mean Absolute Error (PRIMARY): Continuous value prediction accuracy
- Root Mean Squared Error: Regression precision metric
Report Generation:
- GREEN Score (PRIMARY): Comprehensive medical report evaluation with 7 components:
- Entity matching with severity assessment (30%)
- Location accuracy with laterality (20%)
- Negation and uncertainty handling (15%)
- Temporal accuracy (10%)
- Size/measurement accuracy (10%)
- Clinical significance weighting (10%)
- Report structure completeness (5%)
- BLEU Score: Text generation quality
- Clinical Efficacy: Medical relevance scoring
Usage
Installation
pip install transformers torch peft accelerate bitsandbytes
Basic Usage
import torch
from transformers import AutoTokenizer, AutoProcessor
from peft import PeftModel, PeftConfig
from PIL import Image
# Load the fine-tuned model
base_model_name = "Qwen/Qwen2.5-VL-7B-Instruct"
adapter_model_name = "leoyinn/qwen2.5vl-flare2025"
# Load tokenizer and processor
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
processor = AutoProcessor.from_pretrained(base_model_name)
# Load base model and adapter
from transformers import AutoModelForVision2Seq
base_model = AutoModelForVision2Seq.from_pretrained(
base_model_name,
torch_dtype=torch.float16,
device_map="auto",
load_in_4bit=True
)
# Load the fine-tuned adapter
model = PeftModel.from_pretrained(base_model, adapter_model_name)
# Prepare input
image = Image.open("medical_image.jpg")
prompt = "Describe the medical findings in this image."
# Process and generate
inputs = processor(
images=image,
text=prompt,
return_tensors="pt"
).to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
do_sample=True,
temperature=0.7,
top_p=0.9
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Limitations and Ethical Considerations
Limitations
- Model outputs may contain inaccuracies and should be verified by medical professionals
- Performance may vary across different medical imaging modalities and populations
- Training data may contain biases present in medical literature and datasets
- Model has not been validated in clinical settings
Intended Use
- Medical education and training
- Research in medical AI and computer vision
- Development of clinical decision support tools (with proper validation)
- Academic research in multimodal medical AI
Out-of-Scope Use
- Direct clinical diagnosis without physician oversight
- Treatment recommendations without medical professional validation
- Use in emergency medical situations
- Deployment in production clinical systems without extensive validation
Citation
If you use this model in your research, please cite:
@misc{qwen25vl-flare2025,
title={QWen2.5VL Fine-tuned for FLARE 2025 Medical Image Analysis},
author={Shuolin Yin},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/leoyinn/qwen2.5vl-flare2025}
}
@misc{qwen25vl-base,
title={Qwen2.5-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
author={Qwen Team},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct}
}
Model Details
- Model Type: Vision-Language Model (VLM)
- Architecture: QWen2.5VL with LoRA adapters
- Parameters: ~7B base parameters + LoRA adapters
- Precision: 4-bit quantized base model + full precision adapters
- Framework: PyTorch, Transformers, PEFT
Contact
For questions or issues, please open an issue in the model repository or contact the authors.
- Downloads last month
- 9
Model tree for leoyinn/qwen2.5vl-flare2025
Base model
Qwen/Qwen2.5-VL-7B-Instruct