QWen2.5VL Fine-tuned for FLARE 2025 Medical Image Analysis

This model is a fine-tuned version of Qwen/Qwen2.5-VL-7B-Instruct specifically optimized for medical image analysis tasks in the FLARE 2025 2D Medical Multimodal Dataset challenge.

Model Description

  • Base Model: Qwen2.5-VL-7B-Instruct
  • Fine-tuning Method: QLoRA (Low-Rank Adaptation)
  • Target Domain: Medical imaging across 8 modalities (Clinical, Dermatology, Endoscopy, Mammography, Microscopy, Retinography, Ultrasound, Xray)
  • Tasks: Medical image captioning, visual question answering, report generation
  • Training Data: 19 FLARE 2025 datasets with comprehensive medical annotations

Training Details

Training Data

The model was fine-tuned on 19 diverse medical imaging datasets from FLARE 2025, details can be found at: https://huggingface.co/datasets/FLARE-MedFM/FLARE-Task5-MLLM-2D

Training Configuration

# LoRA Configuration
lora_r: 16
lora_alpha: 32
lora_dropout: 0.1
target_modules: ['k_proj', 'v_proj', 'o_proj', 'gate_proj', 'down_proj', 'up_proj', 'q_proj']
task_type: CAUSAL_LM

# Training Statistics
total_steps: 1000
learning_rate: N/A
final_eval_loss: 5.4849

Training Procedure

  • Optimization: 4-bit quantization with BitsAndBytesConfig
  • LoRA Configuration:
    • r=64, alpha=16, dropout=0.1
    • Target modules: All linear layers
  • Memory Optimization: Gradient checkpointing, flash attention
  • Batch Size: Dynamic based on image resolution
  • Learning Rate: 1e-4 with cosine scheduling
  • Training Steps: 1000 steps with evaluation every 500 steps

Model Performance

This model has been evaluated across multiple medical imaging tasks with the following capabilities:

  • Image Captioning: Generates detailed medical reports from imaging studies
  • Visual Question Answering: Answers clinical questions about medical images
  • Diagnosis Support: Identifies pathological findings and abnormalities
  • Multi-modal Understanding: Integrates visual and textual medical information

Evaluation Metrics

The model is evaluated using task-specific metrics following FLARE 2025 specifications:

Classification Tasks:

  • Balanced Accuracy (PRIMARY): Handles class imbalance in medical diagnosis
  • Accuracy: Standard classification accuracy
  • F1 Score: Weighted F1 for multi-class scenarios

Multi-label Classification:

  • F1 Score (PRIMARY): Sample-wise F1 across multiple labels
  • Precision: Label prediction precision
  • Recall: Label coverage recall

Detection Tasks:

  • F1 Score @ IoU > 0.5 (PRIMARY): Standard computer vision detection metric
  • Precision: Detection precision at IoU threshold
  • Recall: Detection recall at IoU threshold

Instance Detection (Identity-Aware):

  • F1 Score @ IoU > 0.3 (PRIMARY): Medical imaging standard for chromosome detection
  • F1 Score @ IoU > 0.5: Computer vision standard
  • Average F1: COCO-style average across IoU thresholds (0.3-0.7)
  • Per-chromosome metrics: Detailed breakdown by chromosome identity

Counting Tasks:

  • Mean Absolute Error (PRIMARY): Cell counting accuracy
  • Root Mean Squared Error: Additional counting precision metric

Regression Tasks:

  • Mean Absolute Error (PRIMARY): Continuous value prediction accuracy
  • Root Mean Squared Error: Regression precision metric

Report Generation:

  • GREEN Score (PRIMARY): Comprehensive medical report evaluation with 7 components:
    • Entity matching with severity assessment (30%)
    • Location accuracy with laterality (20%)
    • Negation and uncertainty handling (15%)
    • Temporal accuracy (10%)
    • Size/measurement accuracy (10%)
    • Clinical significance weighting (10%)
    • Report structure completeness (5%)
  • BLEU Score: Text generation quality
  • Clinical Efficacy: Medical relevance scoring

Usage

Installation

pip install transformers torch peft accelerate bitsandbytes

Basic Usage

import torch
from transformers import AutoTokenizer, AutoProcessor
from peft import PeftModel, PeftConfig
from PIL import Image

# Load the fine-tuned model
base_model_name = "Qwen/Qwen2.5-VL-7B-Instruct"
adapter_model_name = "leoyinn/qwen2.5vl-flare2025"

# Load tokenizer and processor
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
processor = AutoProcessor.from_pretrained(base_model_name)

# Load base model and adapter
from transformers import AutoModelForVision2Seq
base_model = AutoModelForVision2Seq.from_pretrained(
    base_model_name,
    torch_dtype=torch.float16,
    device_map="auto",
    load_in_4bit=True
)

# Load the fine-tuned adapter
model = PeftModel.from_pretrained(base_model, adapter_model_name)

# Prepare input
image = Image.open("medical_image.jpg")
prompt = "Describe the medical findings in this image."

# Process and generate
inputs = processor(
    images=image,
    text=prompt,
    return_tensors="pt"
).to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.7,
        top_p=0.9
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Limitations and Ethical Considerations

Limitations

  • Model outputs may contain inaccuracies and should be verified by medical professionals
  • Performance may vary across different medical imaging modalities and populations
  • Training data may contain biases present in medical literature and datasets
  • Model has not been validated in clinical settings

Intended Use

  • Medical education and training
  • Research in medical AI and computer vision
  • Development of clinical decision support tools (with proper validation)
  • Academic research in multimodal medical AI

Out-of-Scope Use

  • Direct clinical diagnosis without physician oversight
  • Treatment recommendations without medical professional validation
  • Use in emergency medical situations
  • Deployment in production clinical systems without extensive validation

Citation

If you use this model in your research, please cite:

@misc{qwen25vl-flare2025,
  title={QWen2.5VL Fine-tuned for FLARE 2025 Medical Image Analysis},
  author={Shuolin Yin},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/leoyinn/qwen2.5vl-flare2025}
}

@misc{qwen25vl-base,
  title={Qwen2.5-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Qwen Team},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct}
}

Model Details

  • Model Type: Vision-Language Model (VLM)
  • Architecture: QWen2.5VL with LoRA adapters
  • Parameters: ~7B base parameters + LoRA adapters
  • Precision: 4-bit quantized base model + full precision adapters
  • Framework: PyTorch, Transformers, PEFT

Contact

For questions or issues, please open an issue in the model repository or contact the authors.

Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for leoyinn/qwen2.5vl-flare2025

Adapter
(45)
this model