Model Card for kelkalot/medgemma-4b-it-sft-lora-kvasir-vqa

This model is a fine-tuned version of google/medgemma-4b-it adapted for Visual Question Answering (VQA) on medical endoscopic imagery using the Kvasir-VQA dataset. It utilizes LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning.

Model Details

Model Description

This repository contains a LoRA adapter for the google/medgemma-4b-it model. MedGemma is a family of open-weight generative models specialized for the medical domain, built upon Google's Gemma. This adapter enables the base MedGemma model to answer questions about medical images, specifically those found in the Kvasir-VQA dataset which primarily consists of endoscopic images.

  • Developed by: Michael A. Riegler (fine-tuned adapter)
  • Model type: Multimodal (Image and Text) - Vision Language Model
  • Language(s) (NLP): English (en)
  • License: Apache 2.0 (for the adapter, consistent with the base MedGemma model)
  • Finetuned from model: google/medgemma-4b-it

Model Sources [optional]

  • Repository: https://huggingface.co/kelkalot/medgemma-4b-it-sft-lora-kvasir-vqa

Uses

Direct Use

This model adapter is intended for direct use in Visual Question Answering tasks involving medical endoscopic images similar to those in the Kvasir-VQA dataset. It can be loaded with the base google/medgemma-4b-it model to perform inference. Users can provide an image and a textual question to receive a textual answer generated by the model. This is primarily for research and exploration of multimodal AI in medicine.

Example applications:

  • Answering questions about findings in endoscopic images.
  • Educational tool for understanding medical VQA.
  • Research baseline for further improvements in medical VQA.

Downstream Use [optional]

The LoRA adapter could potentially be further fine-tuned on more specific medical VQA datasets or integrated into larger medical AI systems that require image understanding and question answering capabilities.

Out-of-Scope Use

  • Clinical Diagnosis or Treatment Decisions: This model is NOT a medical device and should NOT be used for making clinical diagnoses, treatment decisions, or any other direct patient care. It is a research model.
  • Use outside the medical endoscopic domain: Performance on general VQA or significantly different types of medical images (e.g., X-rays, MRIs, if not represented in Kvasir-VQA or MedGemma's pretraining) is not guaranteed and likely to be poor.
  • High-stakes applications: Any application where an incorrect answer could lead to harm.
  • Generating medical advice for patients.

Bias, Risks, and Limitations

  • Dataset Bias: The model's performance and potential biases are heavily influenced by the Kvasir-VQA dataset and the pre-training data of MedGemma. It may underperform on underrepresented patient demographics or rare conditions.
  • Accuracy: The model may not always provide accurate answers and can hallucinate or provide plausible-sounding but incorrect information.
  • Limited Scope: Its knowledge is confined to what it learned during pre-training and fine-tuning. It does not have real-time information or common sense beyond its training.
  • Not a Medical Professional: The model does not possess the understanding, reasoning, or ethical judgment of a qualified healthcare professional.
  • Technical Limitations: As a LoRA adapter, its performance is tied to the capabilities of the base MedGemma model.

Recommendations

Users (both direct and downstream) must be fully aware of the risks, biases, and limitations of this model.

  • Always validate outputs: Any information provided by the model, especially in a medical context, should be critically evaluated and verified by human experts before any consideration for real-world use.
  • Use with caution: Do not rely on this model for decisions impacting health or safety.
  • Consider ethical implications: Be mindful of the potential for misuse or over-reliance on AI-generated information in sensitive domains like medicine.

How to Get Started with the Model

Use the code below to load the fine-tuned model and run inference. Ensure you have accepted the terms of use for the base google/medgemma-4b-it model.

import torch
from transformers import AutoModelForImageTextToText, AutoProcessor, pipeline
from peft import PeftModel
from PIL import Image
import requests # For fetching image from URL
from io import BytesIO
# from IPython.display import display, HTML # For notebook display

# --- Configuration ---
base_model_name = "google/medgemma-4b-it"
adapter_hub_id = "kelkalot/medgemma-4b-it-sft-lora-kvasir-vqa" # This model

# Determine torch_dtype based on GPU capability
if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 8:
    dtype = torch.bfloat16
    print("Using torch.bfloat16.")
else:
    dtype = torch.float32
    print("Warning: bfloat16 not supported or no GPU. Using float32.")

# --- Load Processor ---
# The processor should have been pushed with your adapter.
try:
    processor = AutoProcessor.from_pretrained(adapter_hub_id)
    print(f"Loaded processor from adapter repository: {adapter_hub_id}")
except Exception as e:
    print(f"Could not load processor from {adapter_hub_id}: {e}. Loading from base model as fallback.")
    processor = AutoProcessor.from_pretrained(base_model_name)
processor.tokenizer.padding_side = "right"

# --- Load Base Model ---
print(f"Loading base model: {base_model_name}")
base_model = AutoModelForImageTextToText.from_pretrained(
    base_model_name,
    torch_dtype=dtype,
    device_map="auto" # Automatically uses GPU if available
)

# --- Apply LoRA Adapter ---
print(f"Applying LoRA adapter from: {adapter_hub_id}")
model = PeftModel.from_pretrained(base_model, adapter_hub_id)
model = model.eval() # Set to evaluation mode

print("Fine-tuned model ready.")

# --- Create Pipeline ---
device = model.device # Get device where PEFT model is loaded
vqa_pipeline = pipeline(
    "image-text-to-text",
    model=model,
    processor=processor,
    device=device
)
print(f"Pipeline created on device: {device}")

# --- Prepare Sample Input ---
# Example image from Kvasir-VQA dataset (replace with your image path or URL)
sample_image_url = "[https://huggingface.co/datasets/SimulaMet-HOST/Kvasir-VQA/resolve/main/images/cju0v82i38xlp0835wz7s6x0k.jpg](https://huggingface.co/datasets/SimulaMet-HOST/Kvasir-VQA/resolve/main/images/cju0v82i38xlp0835wz7s6x0k.jpg)"
sample_question = "What is the main color of the area indicated by the green box?" # Corresponds to the example image

pil_image = None
try:
    response = requests.get(sample_image_url)
    response.raise_for_status() 
    pil_image = Image.open(BytesIO(response.content)).convert("RGB")
except Exception as e:
    print(f"Could not load sample image from URL ({sample_image_url}): {e}")

if pil_image:
    # Format messages for the pipeline (image embedded in messages)
    messages = [
        {"role": "user", "content": [
            {"type": "text", "text": sample_question},
            {"type": "image", "image": pil_image} # Embed the actual PIL image
        ]},
    ]

    # --- Run Inference ---
    print(f"\nQuestion: {sample_question}")
    if vqa_pipeline.device.type == "cuda": torch.cuda.empty_cache()

    output = vqa_pipeline(
        text=[messages], # Pipeline expects a list of conversations
        return_full_text=False,
        max_new_tokens=50 # Adjust as needed
    )

    # Parse output (based on observed structure [[{'generated_text': '...'}]])
    generated_text = "Could not parse output."
    if output and isinstance(output, list) and len(output) > 0:
        first_result_list = output[0]
        if isinstance(first_result_list, list) and len(first_result_list) > 0 and isinstance(first_result_list[0], dict):
            generated_text = first_result_list[0].get("generated_text", "Key 'generated_text' not found").strip()
        elif isinstance(first_result_list, dict): # If output is [{...}]
             generated_text = first_result_list.get("generated_text", "Key 'generated_text' not found").strip()
    
    print(f"Model Answer: {generated_text}")
    
    # For display in Jupyter/Colab:
    # from IPython.display import display, HTML
    # display(HTML(f"<h3>Question: {sample_question}</h3> <p style='color:green;'><b>Model Answer:</b> {generated_text}</p>"))
    # display(pil_image.resize((300, int(300 * pil_image.height / pil_image.width))))
else:
    print("Sample image could not be loaded. Skipping inference example.")

Training Details

Training Data

This model was fine-tuned on the Kvasir-VQA dataset (SimulaMet-HOST/Kvasir-VQA). This dataset contains pairs of endoscopic images and corresponding medical questions and answers.

  • Dataset size used for fine-tuning (subset): Approximately 2000 training samples and 400 validation samples (as per the fine-tuning script parameters).
  • Data preprocessing: Images were used as provided. Questions and answers were formatted into a chat-like structure for the multimodal model.

Training Procedure

The model was fine-tuned using Parameter-Efficient Fine-Tuning (PEFT) with LoRA (Low-Rank Adaptation) via the Hugging Face transformers and trl (SFTTrainer) libraries. 4-bit quantization (QLoRA) was employed to reduce memory footprint during training.

Preprocessing

Input images and text prompts were processed using the AutoProcessor associated with google/medgemma-4b-it. Questions and answers were structured into a conversational format suitable for the SFTTrainer.

Training Hyperparameters

  • LoRA r: 16
  • LoRA alpha: 16
  • LoRA dropout: 0.05
  • Target modules: all-linear layers
  • Learning rate: 1e-4
  • Batch size (per device): 2
  • Gradient accumulation steps: 8 (Effective batch size: 16)
  • Number of epochs: 1 (as per the example script)
  • Optimizer: AdamW (fused)
  • Precision: bfloat16 mixed precision with 4-bit quantization (QLoRA)

Speeds, Sizes, Times [optional]

Fine-tuning was performed on a Google Colab A100 GPU (40GB). Training time for the specified subset and 1 epoch was approximately 45-60 minutes.

Summary

Fine-tuning with LoRA on the Kvasir-VQA dataset showed an improvement in the model's ability to answer questions specific to the dataset's domain compared to the baseline model.

Environmental Impact

  • Hardware Type: Google Colab A100 GPU (40GB)
  • Hours used: Approximately 1 hour for fine-tuning (1 epoch on the subset).
  • Cloud Provider: Google Cloud (via Colab)
  • Compute Region: (Varies for Colab, often US regions)
  • Carbon Emitted: Estimations can be made using the Machine Learning Impact calculator. For a rough estimate: An A100 consumes about 0.4 kWh. 1 hour * 0.4 kWh/hr = 0.4 kWh. Average US grid carbon intensity is ~400 gCO2eq/kWh. So, 0.4 kWh * 400 gCO2eq/kWh = ~160 gCO2eq. (This is a very rough estimate).

Technical Specifications [optional]

Model Architecture and Objective

The base model google/medgemma-4b-it is a Gemma-based multimodal model. This adapter fine-tunes it using LoRA for the objective of generating textual answers to textual questions conditioned on visual (image) input.

Compute Infrastructure

Hardware

  • Google Colab A100 GPU (40GB).

Software

  • PyTorch
  • Hugging Face transformers, peft, trl, datasets, evaluate
  • bitsandbytes for quantization.

Citation [optional]

Please cite the original Kvasir-VQA dataset and MedGemma if you use this model in your research.

Kvasir-VQA & Kvasir Dataset Collection:

@article{borgli2020hyperkvasir,
  title={HyperKvasir, a comprehensive multi-class image and video dataset for gastrointestinal endoscopy},
  author={Borgli, Hanna and Thambawita, Vajira and Smedsrud, Pia H and Hicks, Steven and Jha, Debesh and Eskeland, Sigrun L and Randel, Kristin Ranheim and Pogorelov, Konstantin and Lux, Mathias and Nguyen, Duc Tien Dang and others},
  journal={Scientific data},
  volume={7},
  number={1},
  pages={283},
  year={2020},
  publisher={Nature Publishing Group UK London}
}

@inproceedings{gautam2024kvasir,
  title={Kvasir-vqa: A text-image pair gi tract dataset},
  author={Gautam, Sushant and Stor{\aa}s, Andrea M and Midoglu, Cise and Hicks, Steven A and Thambawita, Vajira and Halvorsen, P{\aa}l and Riegler, Michael A},
  booktitle={Proceedings of the First International Workshop on Vision-Language Models for Biomedical Applications},
  pages={3--12},
  year={2024}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kelkalot/medgemma-4b-it-sft-lora-kvasir-vqa

Adapter
(2)
this model

Dataset used to train kelkalot/medgemma-4b-it-sft-lora-kvasir-vqa