QWen2.5VL Fine-tuned for FLARE 2025 Medical Image Analysis

This model is a fine-tuned version of Qwen/Qwen2.5-VL-7B-Instruct specifically optimized for medical image analysis tasks in the FLARE 2025 2D Medical Multimodal Dataset challenge.

Model Description

Base Model: Qwen2.5-VL-7B-Instruct
Fine-tuning Method: QLoRA (Low-Rank Adaptation)
Target Domain: Medical imaging across 8 modalities (Clinical, Dermatology, Endoscopy, Mammography, Microscopy, Retinography, Ultrasound, Xray)
Tasks: Medical image captioning, visual question answering, report generation
Training Data: 19 FLARE 2025 datasets with comprehensive medical annotations

Training Details

Training Data

The model was fine-tuned on 19 diverse medical imaging datasets from FLARE 2025, details can be found at: https://huggingface.co/datasets/FLARE-MedFM/FLARE-Task5-MLLM-2D

Training Configuration

# LoRA Configuration
lora_r: 16
lora_alpha: 32
lora_dropout: 0.1
target_modules: ['k_proj', 'v_proj', 'o_proj', 'gate_proj', 'down_proj', 'up_proj', 'q_proj']
task_type: CAUSAL_LM

# Training Statistics
total_steps: 1000
learning_rate: N/A
final_eval_loss: 5.4849

Training Procedure

Optimization: 4-bit quantization with BitsAndBytesConfig
LoRA Configuration:
- r=64, alpha=16, dropout=0.1
- Target modules: All linear layers
Memory Optimization: Gradient checkpointing, flash attention
Batch Size: Dynamic based on image resolution
Learning Rate: 1e-4 with cosine scheduling
Training Steps: 1000 steps with evaluation every 500 steps

Model Performance

This model has been evaluated across multiple medical imaging tasks with the following capabilities:

Image Captioning: Generates detailed medical reports from imaging studies
Visual Question Answering: Answers clinical questions about medical images
Diagnosis Support: Identifies pathological findings and abnormalities
Multi-modal Understanding: Integrates visual and textual medical information

Evaluation Metrics

The model is evaluated using task-specific metrics following FLARE 2025 specifications:

Classification Tasks:

Balanced Accuracy (PRIMARY): Handles class imbalance in medical diagnosis
Accuracy: Standard classification accuracy
F1 Score: Weighted F1 for multi-class scenarios

Multi-label Classification:

F1 Score (PRIMARY): Sample-wise F1 across multiple labels
Precision: Label prediction precision
Recall: Label coverage recall

Detection Tasks:

F1 Score @ IoU > 0.5 (PRIMARY): Standard computer vision detection metric
Precision: Detection precision at IoU threshold
Recall: Detection recall at IoU threshold

Instance Detection (Identity-Aware):

F1 Score @ IoU > 0.3 (PRIMARY): Medical imaging standard for chromosome detection
F1 Score @ IoU > 0.5: Computer vision standard
Average F1: COCO-style average across IoU thresholds (0.3-0.7)
Per-chromosome metrics: Detailed breakdown by chromosome identity

Counting Tasks:

Mean Absolute Error (PRIMARY): Cell counting accuracy
Root Mean Squared Error: Additional counting precision metric

Regression Tasks:

Mean Absolute Error (PRIMARY): Continuous value prediction accuracy
Root Mean Squared Error: Regression precision metric

Report Generation:

GREEN Score (PRIMARY): Comprehensive medical report evaluation with 7 components:
- Entity matching with severity assessment (30%)
- Location accuracy with laterality (20%)
- Negation and uncertainty handling (15%)
- Temporal accuracy (10%)
- Size/measurement accuracy (10%)
- Clinical significance weighting (10%)
- Report structure completeness (5%)
BLEU Score: Text generation quality
Clinical Efficacy: Medical relevance scoring

Usage

Installation

pip install transformers torch peft accelerate bitsandbytes

Basic Usage

import torch
from transformers import AutoTokenizer, AutoProcessor
from peft import PeftModel, PeftConfig
from PIL import Image

# Load the fine-tuned model
base_model_name = "Qwen/Qwen2.5-VL-7B-Instruct"
adapter_model_name = "leoyinn/qwen2.5vl-flare2025"

# Load tokenizer and processor
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
processor = AutoProcessor.from_pretrained(base_model_name)

# Load base model and adapter
from transformers import AutoModelForVision2Seq
base_model = AutoModelForVision2Seq.from_pretrained(
    base_model_name,
    torch_dtype=torch.float16,
    device_map="auto",
    load_in_4bit=True
)

# Load the fine-tuned adapter
model = PeftModel.from_pretrained(base_model, adapter_model_name)

# Prepare input
image = Image.open("medical_image.jpg")
prompt = "Describe the medical findings in this image."

# Process and generate
inputs = processor(
    images=image,
    text=prompt,
    return_tensors="pt"
).to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.7,
        top_p=0.9
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Limitations and Ethical Considerations

Limitations

Model outputs may contain inaccuracies and should be verified by medical professionals
Performance may vary across different medical imaging modalities and populations
Training data may contain biases present in medical literature and datasets
Model has not been validated in clinical settings

Intended Use

Medical education and training
Research in medical AI and computer vision
Development of clinical decision support tools (with proper validation)
Academic research in multimodal medical AI

Out-of-Scope Use

Direct clinical diagnosis without physician oversight
Treatment recommendations without medical professional validation
Use in emergency medical situations
Deployment in production clinical systems without extensive validation

Citation

If you use this model in your research, please cite:

@misc{qwen25vl-flare2025,
  title={QWen2.5VL Fine-tuned for FLARE 2025 Medical Image Analysis},
  author={Shuolin Yin},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/leoyinn/qwen2.5vl-flare2025}
}

@misc{qwen25vl-base,
  title={Qwen2.5-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Qwen Team},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct}
}

Model Details

Model Type: Vision-Language Model (VLM)
Architecture: QWen2.5VL with LoRA adapters
Parameters: ~7B base parameters + LoRA adapters
Precision: 4-bit quantized base model + full precision adapters
Framework: PyTorch, Transformers, PEFT

Contact

For questions or issues, please open an issue in the model repository or contact the authors.

leoyinn
/

qwen2.5vl-flare2025