danielhanchen commited on
Commit
da1c5d1
·
verified ·
1 Parent(s): e55a4c8

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +439 -160
README.md CHANGED
@@ -53,14 +53,17 @@ base_model:
53
 
54
  **Resources:**
55
 
56
- * Model on Google Cloud Model Garden: [MedGemma](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/medgemma)
57
- * Models on Hugging Face: [Collection](https://huggingface.co/collections/google/medgemma-release-680aade845f90bec6a3f60c4)
58
- * GitHub repository (supporting code, Colab notebooks, discussions, and issues): [MedGemma](https://github.com/google-health/medgemma)
59
- * Quick start notebook: [GitHub](https://github.com/google-health/medgemma/blob/main/notebooks/quick_start_with_hugging_face.ipynb)
60
- * Fine-tuning notebook: [GitHub](https://github.com/google-health/medgemma/blob/main/notebooks/fine_tune_with_hugging_face.ipynb)
61
- * Concept applications built using MedGemma: [Collection](https://huggingface.co/collections/google/medgemma-concept-apps-686ea036adb6d51416b0928a)
62
- * Support: See [Contact](https://developers.google.com/health-ai-developer-foundations/medgemma/get-started.md#contact)
63
- * License: The use of MedGemma is governed by the [Health AI Developer Foundations terms of use](https://developers.google.com/health-ai-developer-foundations/terms).
 
 
 
64
 
65
  **Author:** Google
66
 
@@ -70,23 +73,59 @@ This section describes the MedGemma model and how to use it.
70
 
71
  ### Description
72
 
73
- MedGemma is a collection of [Gemma 3](https://ai.google.dev/gemma/docs/core) variants that are trained for performance on medical text and image comprehension. Developers can use MedGemma to accelerate building healthcare-based AI applications. MedGemma currently comes in three variants: a 4B multimodal version and 27B text-only and multimodal versions.
74
-
75
- Both MedGemma multimodal versions utilize a [SigLIP](https://arxiv.org/abs/2303.15343) image encoder that has been specifically pre-trained on a variety of de-identified medical data, including chest X-rays, dermatology images, ophthalmology images, and histopathology slides. Their LLM components are trained on a diverse set of medical data, including medical text, medical question-answer pairs, FHIR-based electronic health record data (27B multimodal only), radiology images, histopathology patches, ophthalmology images, and dermatology images.
76
-
77
- MedGemma 4B is available in both pre-trained (suffix: `-pt`) and instruction-tuned (suffix `-it`) versions. The instruction-tuned version is a better starting point for most applications. The pre-trained version is available for those who want to experiment more deeply with the models.
78
-
79
- MedGemma 27B multimodal has pre-training on medical image, medical record and medical record comprehension tasks. MedGemma 27B text-only has been trained exclusively on medical text. Both models have been optimized for inference-time computation on medical reasoning. This means it has slightly higher performance on some text benchmarks than MedGemma 27B multimodal. Users who want to work with a single model for both medical text, medical record and medical image tasks are better suited for MedGemma 27B multimodal. Those that only need text use-cases may be better served with the text-only variant. Both MedGemma 27B variants are only available in instruction-tuned versions.
80
-
81
- MedGemma variants have been evaluated on a range of clinically relevant benchmarks to illustrate their baseline performance. These evaluations are based on both open benchmark datasets and curated datasets. Developers can fine-tune MedGemma variants for improved performance. Consult the [Intended Use](#intended-use) section below for more details.
82
-
83
- MedGemma is optimized for medical applications that involve a text generation component. For medical image-based applications that do not involve text generation, such as data-efficient classification, zero-shot classification, or content-based or semantic image retrieval, the [MedSigLIP image encoder](https://www.google.com/url?q=https://developers.google.com/health-ai-developer-foundations/medsiglip/model-card&sa=D&source=docs&ust=1752034139644236&usg=AOvVaw1HynY8GudljFibw5PuYs4N) is recommended. MedSigLIP is based on the same image encoder that powers MedGemma.
84
-
85
- Please consult the [MedGemma Technical Report](https://arxiv.org/abs/2507.05201) for more details.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
 
87
  ### How to use
88
 
89
- Below are some example code snippets to help you quickly get started running the model locally on GPU. If you want to use the model at scale, we recommend that you create a production version using [Model Garden](https://cloud.google.com/model-garden).
 
 
 
90
 
91
  First, install the Transformers library. Gemma 3 is supported starting from transformers 4.50.0.
92
 
@@ -136,7 +175,7 @@ messages = [
136
  "role": "user",
137
  "content": [
138
  {"type": "text", "text": "Describe this X-ray"},
139
- {"type": "image", "image": image},
140
  ]
141
  }
142
  ]
@@ -225,130 +264,203 @@ print(decoded)
225
 
226
  See the following Colab notebooks for examples of how to use MedGemma:
227
 
228
- * To give the model a quick try, running it locally with weights from Hugging Face, see [Quick start notebook in Colab](https://colab.research.google.com/github/google-health/medgemma/blob/main/notebooks/quick_start_with_hugging_face.ipynb). Note that you will need to use Colab Enterprise to obtain adequate GPU resources to run either 27B model without quantization.
229
- * For an example of fine-tuning, see the [Fine-tuning notebook in Colab](https://colab.research.google.com/github/google-health/medgemma/blob/main/notebooks/fine_tune_with_hugging_face.ipynb). The 27B models can be fine tuned in a similar manner but will require more time and compute resources than the 4B model.
 
 
 
 
 
 
 
 
230
 
231
  ### Model architecture overview
232
 
233
- The MedGemma model is built based on [Gemma 3](https://ai.google.dev/gemma/) and uses the same decoder-only transformer architecture as Gemma 3. To read more about the architecture, consult the Gemma 3 [model card](https://ai.google.dev/gemma/docs/core/model_card_3).
 
 
 
234
 
235
  ### Technical specifications
236
 
237
- * **Model type**: Decoder-only Transformer architecture, see the [Gemma 3 Technical Report](https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf)
238
- * **Input Modalities**: **4B and 27B multimodal**: Text, vision; **27B text**: Text only
239
- * **Output Modality:** Text only (all models)
240
- * **Attention mechanism**: Grouped-query attention (GQA)
241
- * **Context length**: Supports long context, at least 128K tokens
242
- * **Key publication**: https://arxiv.org/abs/2507.05201
243
- * **Model created**: July 9, 2025
244
- * **Model version**: 1.0.0
 
 
245
 
246
  ### Citation
247
 
248
- When using this model, please cite:
249
- Sellergren et al. "MedGemma Technical Report." *arXiv preprint arXiv:2507.05201* (2025).
 
 
 
 
 
 
 
 
 
250
 
251
- @article{sellergren2025medgemma,title={MedGemma Technical Report},author={Sellergren, Andrew and Kazemzadeh, Sahar and Jaroensri, Tiam and Kiraly, Atilla and Traverse, Madeleine and Kohlberger, Timo and Xu, Shawn and Jamil, Fayaz and Hughes, Cían and Lau, Charles and others},journal={arXiv preprint arXiv:2507.05201},year={2025}}
252
  ### Inputs and outputs
253
 
254
  **Input**:
255
 
256
- * Text string, such as a question or prompt
257
- * Images, normalized to 896 x 896 resolution and encoded to 256 tokens each
258
- * Total input length of up to 128K tokens
259
 
260
  **Output**:
261
 
262
- * Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document
263
- * Total output length of up to 8192 tokens
 
264
 
265
  ### Performance and validation
266
 
267
- MedGemma was evaluated across a range of different multimodal classification, report generation, visual question answering, and text-based tasks.
 
268
 
269
- #### Key performance metrics
270
 
271
- ##### Imaging evaluations
272
 
273
- The multimodal performance of MedGemma 4B and 27B multimodal was evaluated across a range of benchmarks, focusing on radiology, dermatology, histopathology, ophthalmology, and multimodal clinical reasoning.
 
 
274
 
275
- MedGemma 4B outperforms the base Gemma 3 4B model across all tested multimodal health benchmarks.
 
276
 
277
- | Task and metric | Gemma 3 4B | MedGemma 4B | Gemma 3 27B | MedGemma 27B multimodal |
278
- | :-------------------------------------------- | :--------: | :---------: | :---------: | :---------------------: |
279
- | **Medical image classification** | | | | |
280
- | MIMIC CXR\*\* - macro F1 for top 5 conditions | 81.2 | 88.9 | 71.7 | 90.0 |
281
- | CheXpert CXR - macro F1 for top 5 conditions | 32.6 | 48.1 | 26.2 | 49.9 |
282
- | CXR14 - macro F1 for 3 conditions | 32.0 | 50.1 | 31.4 | 45.3 |
283
- | PathMCQA\* (histopathology, internal\*\*) - Accuracy | 37.1 | 69.8 | 42.2 | 71.6 |
284
- | US-DermMCQA\* - Accuracy | 52.5 | 71.8 | 66.9 | 71.7 |
285
- | EyePACS\* (fundus, internal) - Accuracy | 14.4 | 64.9 | 20.3 | 75.3 |
286
- | **Visual question answering** | | | | |
287
- | SLAKE (radiology) - Tokenized F1 | 40.2 | 72.3 | 42.5 | 70.0 |
288
- | VQA-RAD\*\*\* (radiology) - Tokenized F1 | 33.6 | 49.9 | 42.7 | 46.7 |
289
- | **Knowledge and reasoning** | | | | |
290
- | MedXpertQA (text + multimodal questions) - Accuracy | 16.4 | 18.8 | 22.0 | 26.8 |
291
 
292
- \* Internal datasets. US-DermMCQA is described in [Liu (2020, Nature medicine)](https://www.nature.com/articles/s41591-020-0842-3), presented as a 4-way MCQ per example for skin condition classification. PathMCQA is based on multiple datasets, presented as 3-9 way MCQ per example for identification, grading, and subtype for breast, cervical, and prostate cancer. EyePACS is a dataset of fundus images with classification labels based on 5-level diabetic retinopathy severity (None, Mild, Moderate, Severe, Proliferative). More details in the [MedGemma Technical Report](https://arxiv.org/abs/2507.05201).
 
 
 
 
 
 
 
293
 
294
- \*\* Based on radiologist adjudicated labels, described in [Yang (2024, arXiv)](https://arxiv.org/pdf/2405.03162) Section A.1.1.
 
295
 
296
- \*\*\*Based on "balanced split," described in [Yang (2024, arXiv)](https://arxiv.org/pdf/2405.03162).
 
297
 
298
- ##### Chest X-ray report generation
299
 
300
- MedGemma chest X-ray (CXR) report generation performance was evaluated on [MIMIC-CXR](https://physionet.org/content/mimic-cxr/2.1.0/) using the [RadGraph F1 metric](https://arxiv.org/abs/2106.14463). We compare the MedGemma pre-trained checkpoint with our previous best model for CXR report generation, [PaliGemma 2](https://arxiv.org/abs/2412.03555).
 
 
 
 
301
 
302
- | Metric | MedGemma 4B (pre-trained) | MedGemma 4B (tuned for CXR) | MedGemma 27B multimodal (pre-trained)\* | PaliGemma 2 3B (tuned for CXR) | PaliGemma 2 10B (tuned for CXR) |
303
- | :---------------------------------- | :-----------------------: | :-------------------------: | :-------------------------------------: | :----------------------------: | :-----------------------------: |
304
- | **Chest X-ray report generation** | | | | | |
305
- | MIMIC CXR - RadGraph F1 | 29.5 | 30.3 | 27.0 | 28.8 | 29.5 |
306
 
307
- \*Not released
308
 
309
- The instruction-tuned versions of MedGemma 4B and MedGemma 27B achieve lower scores (21.9 and 21.3, respectively) due to the differences in reporting style compared to the MIMIC ground truth reports. Further fine-tuning on MIMIC reports enables users to achieve improved performance, as shown by the improved performance of the MedGemma 4B model that was tuned for CXR.
 
 
 
 
310
 
311
- ##### Text evaluations
312
 
313
- MedGemma 4B and text-only MedGemma 27B were evaluated across a range of text-only benchmarks for medical knowledge and reasoning.
 
314
 
315
- The MedGemma models outperform their respective base Gemma models across all tested text-only health benchmarks.
 
316
 
317
- | Metric | Gemma 3 4B | MedGemma 4B | Gemma 3 27B | MedGemma 27B text-only | MedGemma 27B multimodal |
318
- | :---------------------------------- | :--------: | :---------: | :---------: | :--------------------------------: | :----------------------------: |
319
- | MedQA (4-op) | 50.7 | 64.4 | 74.9 | 89.8 (best-of-5) <br> 87.7 (0-shot) | 87.0 (best of 5) <br> 85.3 (0-shot) |
320
- | MedMCQA | 45.4 | 55.7 | 62.6 | 74.2 | 70.2 |
321
- | PubMedQA | 68.4 | 73.4 | 73.4 | 76.8 | 77.2 |
322
- | MMLU Med | 67.2 | 70.0 | 83.3 | 87.0 | 86.2 |
323
- | MedXpertQA (text only) | 11.6 | 14.2 | 15.7 | 25.7 | 23.7 |
324
- | AfriMed-QA (25 question test set) | 48.0 | 52.0 | 72.0 | 84.0 | 72.0 |
325
 
326
- For all MedGemma 27B results, [test-time scaling](https://arxiv.org/abs/2501.19393) is used to improve performance.
 
327
 
328
- ##### Medical Record Evaluations
329
 
330
- All models were evaluated on a question answer dataset from synthetic FHIR data to answer questions about patient records. MedGemma 27B multimodal’s FHIR-specific training gives it significant improvement over other MedGemma and Gemma models.
 
 
 
331
 
332
  | Metric | Gemma 3 4B | MedGemma 4B | Gemma 3 27B | MedGemma 27B text-only | MedGemma 27B multimodal |
333
- | :----: | :--------: | :---------: | :---------: | :--------------------: | :---------------------: |
334
- | EHRQA | 70.9 | 67.6 | 84.2 | 86.3 | 90.5 |
335
 
336
  ### Ethics and safety evaluation
337
 
338
  #### Evaluation approach
339
 
340
- Our evaluation methods include structured evaluations and internal red-teaming testing of relevant content policies. Red-teaming was conducted by a number of different teams, each with different goals and human evaluation metrics. These models were evaluated against a number of different categories relevant to ethics and safety, including:
341
-
342
- * **Child safety**: Evaluation of text-to-text and image-to-text prompts covering child safety policies, including child sexual abuse and exploitation.
343
- * **Content safety:** Evaluation of text-to-text and image-to-text prompts covering safety policies, including harassment, violence and gore, and hate speech.
344
- * **Representational harms**: Evaluation of text-to-text and image-to-text prompts covering safety policies, including bias, stereotyping, and harmful associations or inaccuracies.
345
- * **General medical harms:** Evaluation of text-to-text and image-to-text prompts covering safety policies, including information quality and harmful associations or inaccuracies.
346
-
347
- In addition to development level evaluations, we conduct "assurance evaluations" which are our "arms-length" internal evaluations for responsibility governance decision making. They are conducted separately from the model development team, to inform decision making about release. High-level findings are fed back to the model team, but prompt sets are held out to prevent overfitting and preserve the results' ability to inform decision making. Notable assurance evaluation results are reported to our Responsibility & Safety Council as part of release review.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
348
 
349
  #### Evaluation results
350
 
351
- For all areas of safety testing, we saw safe levels of performance across the categories of child safety, content safety, and representational harms. All testing was conducted without safety filters to evaluate the model capabilities and behaviors. For text-to-text, image-to-text, and audio-to-text, and across both MedGemma model sizes, the model produced minimal policy violations. A limitation of our evaluations was that they included primarily English language prompts.
 
 
 
 
 
 
352
 
353
  ## Data card
354
 
@@ -356,70 +468,195 @@ For all areas of safety testing, we saw safe levels of performance across the ca
356
 
357
  #### Training
358
 
359
- The base Gemma models are pre-trained on a large corpus of text and code data. MedGemma multimodal variants utilize a [SigLIP](https://arxiv.org/abs/2303.15343) image encoder that has been specifically pre-trained on a variety of de-identified medical data, including radiology images, histopathology images, ophthalmology images, and dermatology images. Their LLM component is trained on a diverse set of medical data, including medical text, medical question-answer pairs, FHIR-based electronic health record data (27B multimodal only), radiology images, histopathology patches, ophthalmology images, and dermatology images.
 
 
 
 
 
 
 
 
360
 
361
  #### Evaluation
362
 
363
- MedGemma models have been evaluated on a comprehensive set of clinically relevant benchmarks, including over 22 datasets across 6 different tasks and 4 medical image modalities. These benchmarks include both open and internal datasets.
 
 
 
364
 
365
  #### Source
366
 
367
  MedGemma utilizes a combination of public and private datasets.
368
 
369
- This model was trained on diverse public datasets including MIMIC-CXR (chest X-rays and reports), ChestImaGenome: Set of bounding boxes linking image findings with anatomical regions for MIMIC-CXR (MedGemma 27B multimodal only), SLAKE (multimodal medical images and questions), PAD-UFES-20 (skin lesion images and data), SCIN (dermatology images), TCGA (cancer genomics data), CAMELYON (lymph node histopathology images), PMC-OA (biomedical literature with images), and Mendeley Digital Knee X-Ray (knee X-rays).
370
-
371
- Additionally, multiple diverse proprietary datasets were licensed and incorporated (described next).
372
-
373
- ### Data Ownership and Documentation
374
-
375
- * [**MIMIC-CXR**](https://physionet.org/content/mimic-cxr/2.1.0/)**:** MIT Laboratory for Computational Physiology and Beth Israel Deaconess Medical Center (BIDMC).
376
- * [**SLAKE**](https://www.med-vqa.com/slake/)**:** The Hong Kong Polytechnic University (PolyU), with collaborators including West China Hospital of Sichuan University and Sichuan Academy of Medical Sciences / Sichuan Provincial People's Hospital.
377
- * [**PAD-UFES-20**](https://pmc.ncbi.nlm.nih.gov/articles/PMC7479321/)**:** Federal University of Espírito Santo (UFES), Brazil, through its Dermatological and Surgical Assistance Program (PAD).
378
- * [**SCIN**](https://github.com/google-research-datasets/scin)**:** A collaboration between Google Health and Stanford Medicine.
379
- * [**TCGA**](https://portal.gdc.cancer.gov/) **(The Cancer Genome Atlas):** A joint effort of National Cancer Institute and National Human Genome Research Institute. Data from TCGA are available via the Genomic Data Commons (GDC)
380
- * [**CAMELYON**](https://camelyon17.grand-challenge.org/Data/)**:** The data was collected from Radboud University Medical Center and University Medical Center Utrecht in the Netherlands.
381
- * [**PMC-OA (PubMed Central Open Access Subset)**](https://catalog.data.gov/dataset/pubmed-central-open-access-subset-pmc-oa)**:** Maintained by the National Library of Medicine (NLM) and National Center for Biotechnology Information (NCBI), which are part of the NIH.
382
- * [**Mendeley Digital Knee X-Ray**](https://data.mendeley.com/datasets/t9ndx37v5h/1)**:** This dataset is from Rani Channamma University, and is hosted on Mendeley Data.
383
- * [**VQA-RAD**](https://www.nature.com/articles/sdata2018251)**:** This dataset was created by a research team led by Jason J. Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman and their affiliated institutions (the US National Library of Medicine and National Institutes of Health)
384
- * [**Chest ImaGenome**](https://physionet.org/content/chest-imagenome/1.0.0/)**:** IBM Research.
385
- * [**MedQA**](https://arxiv.org/pdf/2009.13081)**:** This dataset was created by a team of researchers led by Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits
386
- * [**AfriMed-QA**](https://afrimedqa.com/)**:** This data was developed and led by multiple collaborating organizations and researchers include key contributors: Intron Health, SisonkeBiotik, BioRAMP, Georgia Institute of Technology, and MasakhaneNLP.
387
- * [**MedExpQA**](https://www.sciencedirect.com/science/article/pii/S0933365724001805)**:** This dataset was created by researchers at the HiTZ Center (Basque Center for Language Technology and Artificial Intelligence).
388
- * [**MedXpertQA**](https://huggingface.co/datasets/TsinghuaC3I/MedXpertQA)**:** This dataset was developed by researchers at Tsinghua University (Beijing, China) and Shanghai Artificial Intelligence Laboratory (Shanghai, China).
389
- * [**HealthSearchQA**](https://huggingface.co/datasets/katielink/healthsearchqa)**:** This dataset consists of consisting of 3,173 commonly searched consumer questions
390
-
391
- In addition to the public datasets listed above, MedGemma was also trained on de-identified, licensed datasets or datasets collected internally at Google from consented participants.
392
-
393
- * **Radiology dataset 1:** De-identified dataset of different CT studies across body parts from a US-based radiology outpatient diagnostic center network.
394
- * **Ophthalmology dataset 1 (EyePACS):** De-identified dataset of fundus images from diabetic retinopathy screening.
395
- * **Dermatology dataset 1:** De-identified dataset of teledermatology skin condition images (both clinical and dermatoscopic) from Colombia.
396
- * **Dermatology dataset 2:** De-identified dataset of skin cancer images (both clinical and dermatoscopic) from Australia.
397
- * **Dermatology dataset 3:** De-identified dataset of non-diseased skin images from an internal data collection effort.
398
- * **Pathology dataset 1:** De-identified dataset of histopathology H&E whole slide images created in collaboration with an academic research hospital and biobank in Europe. Comprises de-identified colon, prostate, and lymph nodes.
399
- * **Pathology dataset 2:** De-identified dataset of lung histopathology H&E and IHC whole slide images created by a commercial biobank in the United States.
400
- * **Pathology dataset 3:** De-identified dataset of prostate and lymph node H&E and IHC histopathology whole slide images created by a contract research organization in the United States.
401
- * **Pathology dataset 4:** De-identified dataset of histopathology whole slide images created in collaboration with a large, tertiary teaching hospital in the United States. Comprises a diverse set of tissue and stain types, predominantly H&E.
402
- * **EHR dataset 1:** Question/answer dataset drawn from synthetic FHIR records created by [Synthea.](https://synthetichealth.github.io/synthea/) The test set includes 19 unique patients with 200 questions per patient divided into 10 different categories.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
403
 
404
  ### Data citation
405
 
406
- * **MIMIC-CXR:** Johnson, A., Pollard, T., Mark, R., Berkowitz, S., & Horng, S. (2024). MIMIC-CXR Database (version 2.1.0). PhysioNet. [https://physionet.org/content/mimic-cxr/2.1.0/](https://physionet.org/content/mimic-cxr/2.1.0/) *and* Johnson, Alistair E. W., Tom J. Pollard, Seth J. Berkowitz, Nathaniel R. Greenbaum, Matthew P. Lungren, Chih-Ying Deng, Roger G. Mark, and Steven Horng. 2019. "MIMIC-CXR, a de-Identified Publicly Available Database of Chest Radiographs with Free-Text Reports." *Scientific Data 6* (1): 1–8.
407
- * **SLAKE:** Liu, Bo, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. 2021.SLAKE: A Semantically-Labeled Knowledge-Enhanced Dataset for Medical Visual Question Answering." [http://arxiv.org/abs/2102.09542](http://arxiv.org/abs/2102.09542).
408
- * **PAD-UEFS-20:** Pacheco, Andre GC, et al. "PAD-UFES-20: A skin lesion dataset composed of patient data and clinical images collected from smartphones." *Data in brief* 32 (2020): 106221.
409
- * **SCIN:** Ward, Abbi, Jimmy Li, Julie Wang, Sriram Lakshminarasimhan, Ashley Carrick, Bilson Campana, Jay Hartford, et al. 2024. "Creating an Empirical Dermatology Dataset Through Crowdsourcing With Web Search Advertisements." *JAMA Network Open 7* (11): e2446615–e2446615.
410
- * **TCGA:** The results shown here are in whole or part based upon data generated by the TCGA Research Network: [https://www.cancer.gov/tcga](https://www.cancer.gov/tcga).
411
- * **CAMELYON16:** Ehteshami Bejnordi, Babak, Mitko Veta, Paul Johannes van Diest, Bram van Ginneken, Nico Karssemeijer, Geert Litjens, Jeroen A. W. M. van der Laak, et al. 2017. "Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer." *JAMA 318* (22): 2199–2210.
412
- * **Mendeley Digital Knee X-Ray:** Gornale, Shivanand; Patravali, Pooja (2020), "Digital Knee X-ray Images", Mendeley Data, V1, doi: 10.17632/t9ndx37v5h.1
413
- * **VQA-RAD:** Lau, Jason J., Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. 2018. "A Dataset of Clinically Generated Visual Questions and Answers about Radiology Images." *Scientific Data 5* (1): 1–10.
414
- * **Chest ImaGenome:** Wu, J., Agu, N., Lourentzou, I., Sharma, A., Paguio, J., Yao, J. S., Dee, E. C., Mitchell, W., Kashyap, S., Giovannini, A., Celi, L. A., Syeda-Mahmood, T., & Moradi, M. (2021). Chest ImaGenome Dataset (version 1.0.0). PhysioNet. RRID:SCR_007345. [https://doi.org/10.13026/wv01-y230](https://doi.org/10.13026/wv01-y230)
415
- * **MedQA:** Jin, Di, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2020. "What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams." [http://arxiv.org/abs/2009.13081](http://arxiv.org/abs/2009.13081).
416
- * **AfrimedQA:** Olatunji, Tobi, Charles Nimo, Abraham Owodunni, Tassallah Abdullahi, Emmanuel Ayodele, Mardhiyah Sanni, Chinemelu Aka, et al. 2024. "AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset." [http://arxiv.org/abs/2411.15640](http://arxiv.org/abs/2411.15640).
417
- * **MedExpQA:** Alonso, I., Oronoz, M., & Agerri, R. (2024). MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering. *arXiv preprint arXiv:2404.05590*. Retrieved from [https://arxiv.org/abs/2404.05590](https://arxiv.org/abs/2404.05590)
418
- * **MedXpertQA:** Zuo, Yuxin, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. 2025. "MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding." [http://arxiv.org/abs/2501.18362](http://arxiv.org/abs/2501.18362).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
419
 
420
  ### De-identification/anonymization:
421
 
422
- Google and its partners utilize datasets that have been rigorously anonymized or de-identified to ensure the protection of individual research participants and patient privacy.
 
 
423
 
424
  ## Implementation information
425
 
@@ -429,33 +666,75 @@ Details about the model internals.
429
 
430
  Training was done using [JAX](https://github.com/jax-ml/jax).
431
 
432
- JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models.
 
433
 
434
  ## Use and limitations
435
 
436
  ### Intended use
437
 
438
- MedGemma is an open multimodal generative AI model intended to be used as a starting point that enables more efficient development of downstream healthcare applications involving medical text and images. MedGemma is intended for developers in the life sciences and healthcare space. Developers are responsible for training, adapting and making meaningful changes to MedGemma to accomplish their specific intended use. MedGemma models can be fine-tuned by developers using their own proprietary data for their specific tasks or solutions.
439
-
440
- MedGemma is based on Gemma 3 and has been further trained on medical images and text. MedGemma enables further development in any medical context (image and textual), however the model was pre-trained using chest X-ray, pathology, dermatology, and fundus images. Examples of tasks within MedGemma's training include visual question answering pertaining to medical images, such as radiographs, or providing answers to textual medical questions. Full details of all the tasks MedGemma has been evaluated can be found in the [MedGemma Technical Report.](https://arxiv.org/abs/2507.05201)
 
 
 
 
 
 
 
 
 
 
 
 
 
441
 
442
  ### Benefits
443
 
444
- * Provides strong baseline medical image and text comprehension for models of its size.
445
- * This strong performance makes it efficient to adapt for downstream healthcare-based use cases, compared to models of similar size without medical data pre-training.
446
- * This adaptation may involve prompt engineering, grounding, agentic orchestration or fine-tuning depending on the use case, baseline validation requirements, and desired performance characteristics.
 
 
 
 
 
447
 
448
  ### Limitations
449
 
450
- MedGemma is not intended to be used without appropriate validation, adaptation and/or making meaningful modification by developers for their specific use case. The outputs generated by MedGemma are not intended to directly inform clinical diagnosis, patient management decisions, treatment recommendations, or any other direct clinical practice applications. Performance benchmarks highlight baseline capabilities on relevant benchmarks, but even for image and text domains that constitute a substantial portion of training data, inaccurate model output is possible. All outputs from MedGemma should be considered preliminary and require independent verification, clinical correlation, and further investigation through established research and development methodologies.
451
-
452
- MedGemma's multimodal capabilities have been primarily evaluated on single-image tasks. MedGemma has not been evaluated in use cases that involve comprehension of multiple images.
 
 
 
 
 
 
 
 
 
 
 
453
 
454
  MedGemma has not been evaluated or optimized for multi-turn applications.
455
 
456
- MedGemma's training may make it more sensitive to the specific prompt used than Gemma 3.
 
457
 
458
  When adapting MedGemma developer should consider the following:
459
 
460
- * **Bias in validation data:** As with any research, developers should ensure that any downstream application is validated to understand performance using data that is appropriately representative of the intended use setting for the specific application (e.g., age, sex, gender, condition, imaging device, etc).
461
- * **Data contamination concerns**: When evaluating the generalization capabilities of a large model like MedGemma in a medical context, there is a risk of data contamination, where the model might have inadvertently seen related medical information during its pre-training, potentially overestimating its true ability to generalize to novel medical concepts. Developers should validate MedGemma on datasets not publicly available or otherwise made available to non-institutional researchers to mitigate this risk.
 
 
 
 
 
 
 
 
 
 
 
 
53
 
54
  **Resources:**
55
 
56
+ * Model on Google Cloud Model Garden: [MedGemma](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/medgemma)
57
+ * Models on Hugging Face: [Collection](https://huggingface.co/collections/google/medgemma-release-680aade845f90bec6a3f60c4)
58
+ * GitHub repository (supporting code, Colab notebooks, discussions, and
59
+ issues): [MedGemma](https://github.com/google-health/medgemma)
60
+ * Quick start notebook: [GitHub](https://github.com/google-health/medgemma/blob/main/notebooks/quick_start_with_hugging_face.ipynb)
61
+ * Fine-tuning notebook: [GitHub](https://github.com/google-health/medgemma/blob/main/notebooks/fine_tune_with_hugging_face.ipynb)
62
+ * Concept applications built using MedGemma: [Collection](https://huggingface.co/collections/google/medgemma-concept-apps-686ea036adb6d51416b0928a)
63
+ * Support: See [Contact](https://developers.google.com/health-ai-developer-foundations/medgemma/get-started.md#contact)
64
+ * License: The use of MedGemma is governed by the [Health AI Developer
65
+ Foundations terms of
66
+ use](https://developers.google.com/health-ai-developer-foundations/terms).
67
 
68
  **Author:** Google
69
 
 
73
 
74
  ### Description
75
 
76
+ MedGemma is a collection of [Gemma 3](https://ai.google.dev/gemma/docs/core)
77
+ variants that are trained for performance on medical text and image
78
+ comprehension. Developers can use MedGemma to accelerate building
79
+ healthcare-based AI applications. MedGemma currently comes in three variants: a
80
+ 4B multimodal version and 27B text-only and multimodal versions.
81
+
82
+ Both MedGemma multimodal versions utilize a
83
+ [SigLIP](https://arxiv.org/abs/2303.15343) image encoder that has been
84
+ specifically pre-trained on a variety of de-identified medical data, including
85
+ chest X-rays, dermatology images, ophthalmology images, and histopathology
86
+ slides. Their LLM components are trained on a diverse set of medical data,
87
+ including medical text, medical question-answer pairs, FHIR-based electronic
88
+ health record data (27B multimodal only), radiology images, histopathology
89
+ patches, ophthalmology images, and dermatology images.
90
+
91
+ MedGemma 4B is available in both pre-trained (suffix: `-pt`) and
92
+ instruction-tuned (suffix `-it`) versions. The instruction-tuned version is a
93
+ better starting point for most applications. The pre-trained version is
94
+ available for those who want to experiment more deeply with the models.
95
+
96
+ MedGemma 27B multimodal has pre-training on medical image, medical record and
97
+ medical record comprehension tasks. MedGemma 27B text-only has been trained
98
+ exclusively on medical text. Both models have been optimized for inference-time
99
+ computation on medical reasoning. This means it has slightly higher performance
100
+ on some text benchmarks than MedGemma 27B multimodal. Users who want to work
101
+ with a single model for both medical text, medical record and medical image
102
+ tasks are better suited for MedGemma 27B multimodal. Those that only need text
103
+ use-cases may be better served with the text-only variant. Both MedGemma 27B
104
+ variants are only available in instruction-tuned versions.
105
+
106
+ MedGemma variants have been evaluated on a range of clinically relevant
107
+ benchmarks to illustrate their baseline performance. These evaluations are based
108
+ on both open benchmark datasets and curated datasets. Developers can fine-tune
109
+ MedGemma variants for improved performance. Consult the [Intended
110
+ use](#intended-use) section below for more details.
111
+
112
+ MedGemma is optimized for medical applications that involve a text generation
113
+ component. For medical image-based applications that do not involve text
114
+ generation, such as data-efficient classification, zero-shot classification, or
115
+ content-based or semantic image retrieval, the [MedSigLIP image
116
+ encoder](https://developers.google.com/health-ai-developer-foundations/medsiglip/model-card)
117
+ is recommended. MedSigLIP is based on the same image encoder that powers
118
+ MedGemma.
119
+
120
+ Please consult the [MedGemma Technical Report](https://arxiv.org/abs/2507.05201)
121
+ for more details.
122
 
123
  ### How to use
124
 
125
+ Below are some example code snippets to help you quickly get started running the
126
+ model locally on GPU. If you want to use the model at scale, we recommend that
127
+ you create a production version using [Model
128
+ Garden](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/medgemma).
129
 
130
  First, install the Transformers library. Gemma 3 is supported starting from transformers 4.50.0.
131
 
 
175
  "role": "user",
176
  "content": [
177
  {"type": "text", "text": "Describe this X-ray"},
178
+ {"type": "image", "image": image}
179
  ]
180
  }
181
  ]
 
264
 
265
  See the following Colab notebooks for examples of how to use MedGemma:
266
 
267
+ * To give the model a quick try, running it locally with weights from Hugging
268
+ Face, see [Quick start notebook in
269
+ Colab](https://colab.research.google.com/github/google-health/medgemma/blob/main/notebooks/quick_start_with_hugging_face.ipynb).
270
+ Note that you will need to use Colab Enterprise to obtain adequate GPU
271
+ resources to run either 27B model without quantization.
272
+
273
+ * For an example of fine-tuning the 4B model, see the [Fine-tuning notebook in
274
+ Colab](https://colab.research.google.com/github/google-health/medgemma/blob/main/notebooks/fine_tune_with_hugging_face.ipynb).
275
+ The 27B models can be fine tuned in a similar manner but will require more
276
+ time and compute resources than the 4B model.
277
 
278
  ### Model architecture overview
279
 
280
+ The MedGemma model is built based on [Gemma 3](https://ai.google.dev/gemma/) and
281
+ uses the same decoder-only transformer architecture as Gemma 3. To read more
282
+ about the architecture, consult the Gemma 3 [model
283
+ card](https://ai.google.dev/gemma/docs/core/model_card_3).
284
 
285
  ### Technical specifications
286
 
287
+ * **Model type**: Decoder-only Transformer architecture, see the [Gemma 3
288
+ Technical
289
+ Report](https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf)
290
+ * **Input modalities**: **4B and 27B multimodal**: Text, vision; **27B text**: Text only
291
+ * **Output modality:** Text only (all models)
292
+ * **Attention mechanism**: Grouped-query attention (GQA)
293
+ * **Context length**: Supports long context, at least 128K tokens
294
+ * **Key publication**: [https://arxiv.org/abs/2507.05201](https://arxiv.org/abs/2507.05201)
295
+ * **Model created**: July 9, 2025
296
+ * **Model version**: 1.0.0
297
 
298
  ### Citation
299
 
300
+ When using this model, please cite: Sellergren et al. "MedGemma Technical
301
+ Report." *arXiv preprint arXiv:2507.05201* (2025).
302
+
303
+ ```none
304
+ @article{sellergren2025medgemma,
305
+ title={MedGemma Technical Report},
306
+ author={Sellergren, Andrew and Kazemzadeh, Sahar and Jaroensri, Tiam and Kiraly, Atilla and Traverse, Madeleine and Kohlberger, Timo and Xu, Shawn and Jamil, Fayaz and Hughes, Cían and Lau, Charles and others},
307
+ journal={arXiv preprint arXiv:2507.05201},
308
+ year={2025}
309
+ }
310
+ ```
311
 
 
312
  ### Inputs and outputs
313
 
314
  **Input**:
315
 
316
+ * Text string, such as a question or prompt
317
+ * Images, normalized to 896 x 896 resolution and encoded to 256 tokens each
318
+ * Total input length of 128K tokens
319
 
320
  **Output**:
321
 
322
+ * Generated text in response to the input, such as an answer to a question,
323
+ analysis of image content, or a summary of a document
324
+ * Total output length of 8192 tokens
325
 
326
  ### Performance and validation
327
 
328
+ MedGemma was evaluated across a range of different multimodal classification,
329
+ report generation, visual question answering, and text-based tasks.
330
 
331
+ ### Key performance metrics
332
 
333
+ #### Imaging evaluations
334
 
335
+ The multimodal performance of MedGemma 4B and 27B multimodal was evaluated
336
+ across a range of benchmarks, focusing on radiology, dermatology,
337
+ histopathology, ophthalmology, and multimodal clinical reasoning.
338
 
339
+ MedGemma 4B outperforms the base Gemma 3 4B model across all tested multimodal
340
+ health benchmarks.
341
 
342
+ | Task and metric | Gemma 3 4B | MedGemma 4B | Gemma 3 27B | MedGemma 27B multimodal |
343
+ | :---- | :---- | :---- | :---- | :---- |
344
+ | **Medical image classification** | | | | |
345
+ | MIMIC CXR** - macro F1 for top 5 conditions | 81.2 | 88.9 | 71.7 | 90.0 |
346
+ | CheXpert CXR - macro F1 for top 5 conditions | 32.6 | 48.1 | 26.2 | 49.9 |
347
+ | CXR14 - macro F1 for 3 conditions | 32.0 | 50.1 | 31.4 | 45.3 |
348
+ | PathMCQA* (histopathology, internal**) - Accuracy | 37.1 | 69.8 | 42.2 | 71.6 |
349
+ | US-DermMCQA* - Accuracy | 52.5 | 71.8 | 66.9 | 71.7 |
350
+ | EyePACS* (fundus, internal) - Accuracy | 14.4 | 64.9 | 20.3 | 75.3 |
351
+ | **Visual question answering** | | | | |
352
+ | SLAKE (radiology) - Tokenized F1 | 40.2 | 72.3 | 42.5 | 70.0 |
353
+ | VQA-RAD*** (radiology) - Tokenized F1 | 33.6 | 49.9 | 42.7 | 46.7 |
354
+ | **Knowledge and reasoning** | | | | |
355
+ | MedXpertQA (text + multimodal questions) - Accuracy | 16.4 | 18.8 | 22.0 | 26.8 |
356
 
357
+ *Internal datasets. US-DermMCQA is described in [Liu (2020, Nature
358
+ medicine)](https://www.nature.com/articles/s41591-020-0842-3), presented as a
359
+ 4-way MCQ per example for skin condition classification. PathMCQA is based on
360
+ multiple datasets, presented as 3-9 way MCQ per example for identification,
361
+ grading, and subtype for breast, cervical, and prostate cancer. EyePACS is a
362
+ dataset of fundus images with classification labels based on 5-level diabetic
363
+ retinopathy severity (None, Mild, Moderate, Severe, Proliferative). More details
364
+ in the [MedGemma Technical Report](https://arxiv.org/abs/2507.05201).
365
 
366
+ **Based on radiologist adjudicated labels, described in [Yang (2024,
367
+ arXiv)](https://arxiv.org/pdf/2405.03162) Section A.1.1.
368
 
369
+ ***Based on "balanced split," described in [Yang (2024,
370
+ arXiv)](https://arxiv.org/pdf/2405.03162).
371
 
372
+ #### Chest X-ray report generation
373
 
374
+ MedGemma chest X-ray (CXR) report generation performance was evaluated on
375
+ [MIMIC-CXR](https://physionet.org/content/mimic-cxr/2.1.0/) using the [RadGraph
376
+ F1 metric](https://arxiv.org/abs/2106.14463). We compare the MedGemma
377
+ pre-trained checkpoint with our previous best model for CXR report generation,
378
+ [PaliGemma 2](https://arxiv.org/abs/2412.03555).
379
 
380
+ | Metric | MedGemma 4B (pre-trained) | MedGemma 4B (tuned for CXR) | MedGemma 27B multimodal (pre-trained)* | PaliGemma 2 3B (tuned for CXR) | PaliGemma 2 10B (tuned for CXR) |
381
+ | :---- | :---- | :---- | :---- | :---- | :---- |
382
+ | **Chest X-ray report generation** | | | | | |
383
+ | MIMIC CXR - RadGraph F1 | 29.5 | 30.3 | 27.0 | 28.8 | 29.5 |
384
 
385
+ *Not released
386
 
387
+ The instruction-tuned versions of MedGemma 4B and MedGemma 27B achieve lower
388
+ scores (21.9 and 21.3, respectively) due to the differences in reporting style
389
+ compared to the MIMIC ground truth reports. Further fine-tuning on MIMIC reports
390
+ enables users to achieve improved performance, as shown by the improved
391
+ performance of the MedGemma 4B model that was tuned for CXR.
392
 
393
+ #### Text evaluations
394
 
395
+ MedGemma 4B and text-only MedGemma 27B were evaluated across a range of
396
+ text-only benchmarks for medical knowledge and reasoning.
397
 
398
+ The MedGemma models outperform their respective base Gemma models across all
399
+ tested text-only health benchmarks.
400
 
401
+ | Metric | Gemma 3 4B | MedGemma 4B | Gemma 3 27B | MedGemma 27B text-only | MedGemma 27B multimodal |
402
+ | :---- | :---- | :---- | :---- | :---- | :---- |
403
+ | MedQA (4-op) | 50.7 | 64.4 | 74.9 | 89.8 (best-of-5) 87.7 (0-shot) | 87.0 (best-of-5) 85.3 (0-shot) |
404
+ | MedMCQA | 45.4 | 55.7 | 62.6 | 74.2 | 70.2 |
405
+ | PubMedQA | 68.4 | 73.4 | 73.4 | 76.8 | 77.2 |
406
+ | MMLU Med | 67.2 | 70.0 | 83.3 | 87.0 | 86.2 |
407
+ | MedXpertQA (text only) | 11.6 | 14.2 | 15.7 | 25.7 | 23.7 |
408
+ | AfriMed-QA (25 question test set) | 48.0 | 52.0 | 72.0 | 84.0 | 72.0 |
409
 
410
+ For all MedGemma 27B results, [test-time
411
+ scaling](https://arxiv.org/abs/2501.19393) is used to improve performance.
412
 
413
+ #### Medical record evaluations
414
 
415
+ All models were evaluated on a question answer dataset from synthetic FHIR data
416
+ to answer questions about patient records. MedGemma 27B multimodal's
417
+ FHIR-specific training gives it significant improvement over other MedGemma and
418
+ Gemma models.
419
 
420
  | Metric | Gemma 3 4B | MedGemma 4B | Gemma 3 27B | MedGemma 27B text-only | MedGemma 27B multimodal |
421
+ | :---- | :---- | :---- | :---- | :---- | :---- |
422
+ | EHRQA | 70.9 | 67.6 | 84.2 | 86.3 | 90.5 |
423
 
424
  ### Ethics and safety evaluation
425
 
426
  #### Evaluation approach
427
 
428
+ Our evaluation methods include structured evaluations and internal red-teaming
429
+ testing of relevant content policies. Red-teaming was conducted by a number of
430
+ different teams, each with different goals and human evaluation metrics. These
431
+ models were evaluated against a number of different categories relevant to
432
+ ethics and safety, including:
433
+
434
+ * **Child safety**: Evaluation of text-to-text and image-to-text prompts
435
+ covering child safety policies, including child sexual abuse and
436
+ exploitation.
437
+ * **Content safety**: Evaluation of text-to-text and image-to-text prompts
438
+ covering safety policies, including harassment, violence and gore, and hate
439
+ speech.
440
+ * **Representational harms**: Evaluation of text-to-text and image-to-text
441
+ prompts covering safety policies, including bias, stereotyping, and harmful
442
+ associations or inaccuracies.
443
+ * **General medical harms**: Evaluation of text-to-text and image-to-text
444
+ prompts covering safety policies, including information quality and harmful
445
+ associations or inaccuracies.
446
+
447
+ In addition to development level evaluations, we conduct "assurance evaluations"
448
+ which are our "arms-length" internal evaluations for responsibility governance
449
+ decision making. They are conducted separately from the model development team,
450
+ to inform decision making about release. High-level findings are fed back to the
451
+ model team, but prompt sets are held out to prevent overfitting and preserve the
452
+ results' ability to inform decision making. Notable assurance evaluation results
453
+ are reported to our Responsibility & Safety Council as part of release review.
454
 
455
  #### Evaluation results
456
 
457
+ For all areas of safety testing, we saw safe levels of performance across the
458
+ categories of child safety, content safety, and representational harms. All
459
+ testing was conducted without safety filters to evaluate the model capabilities
460
+ and behaviors. For text-to-text, image-to-text, and audio-to-text, and across
461
+ both MedGemma model sizes, the model produced minimal policy violations. A
462
+ limitation of our evaluations was that they included primarily English language
463
+ prompts.
464
 
465
  ## Data card
466
 
 
468
 
469
  #### Training
470
 
471
+ The base Gemma models are pre-trained on a large corpus of text and code data.
472
+ MedGemma multimodal variants utilize a
473
+ [SigLIP](https://arxiv.org/abs/2303.15343) image encoder that has been
474
+ specifically pre-trained on a variety of de-identified medical data, including
475
+ radiology images, histopathology images, ophthalmology images, and dermatology
476
+ images. Their LLM component is trained on a diverse set of medical data,
477
+ including medical text, medical question-answer pairs, FHIR-based electronic
478
+ health record data (27B multimodal only), radiology images, histopathology
479
+ patches, ophthalmology images, and dermatology images.
480
 
481
  #### Evaluation
482
 
483
+ MedGemma models have been evaluated on a comprehensive set of clinically
484
+ relevant benchmarks, including over 22 datasets across 6 different tasks and 4
485
+ medical image modalities. These benchmarks include both open and internal
486
+ datasets.
487
 
488
  #### Source
489
 
490
  MedGemma utilizes a combination of public and private datasets.
491
 
492
+ This model was trained on diverse public datasets including MIMIC-CXR (chest
493
+ X-rays and reports), ChestImaGenome: Set of bounding boxes linking image
494
+ findings with anatomical regions for MIMIC-CXR (MedGemma 27B multimodal only),
495
+ SLAKE (multimodal medical images and questions), PAD-UFES-20 (skin lesion images
496
+ and data), SCIN (dermatology images), TCGA (cancer genomics data), CAMELYON
497
+ (lymph node histopathology images), PMC-OA (biomedical literature with images),
498
+ and Mendeley Digital Knee X-Ray (knee X-rays).
499
+
500
+ Additionally, multiple diverse proprietary datasets were licensed and
501
+ incorporated (described next).
502
+
503
+ ### Data ownership and documentation
504
+
505
+ * [MIMIC-CXR](https://physionet.org/content/mimic-cxr/2.1.0/): MIT Laboratory
506
+ for Computational Physiology and Beth Israel Deaconess Medical Center
507
+ (BIDMC).
508
+ * [Slake-VQA](https://www.med-vqa.com/slake/): The Hong Kong Polytechnic
509
+ University (PolyU), with collaborators including West China Hospital of
510
+ Sichuan University and Sichuan Academy of Medical Sciences / Sichuan
511
+ Provincial People's Hospital.
512
+ * [PAD-UFES-20](https://pmc.ncbi.nlm.nih.gov/articles/PMC7479321/): Federal
513
+ University of Espírito Santo (UFES), Brazil, through its Dermatological and
514
+ Surgical Assistance Program (PAD).
515
+ * [SCIN](https://github.com/google-research-datasets/scin): A collaboration
516
+ between Google Health and Stanford Medicine.
517
+ * [TCGA](https://portal.gdc.cancer.gov/) (The Cancer Genome Atlas): A joint
518
+ effort of National Cancer Institute and National Human Genome Research
519
+ Institute. Data from TCGA are available via the Genomic Data Commons (GDC)
520
+ * [CAMELYON](https://camelyon17.grand-challenge.org/Data/): The data was
521
+ collected from Radboud University Medical Center and University Medical
522
+ Center Utrecht in the Netherlands.
523
+ * [PMC-OA (PubMed Central Open Access
524
+ Subset)](https://catalog.data.gov/dataset/pubmed-central-open-access-subset-pmc-oa):
525
+ Maintained by the National Library of Medicine (NLM) and National Center for
526
+ Biotechnology Information (NCBI), which are part of the NIH.
527
+ * [MedQA](https://arxiv.org/pdf/2009.13081): This dataset was created by a
528
+ team of researchers led by Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung
529
+ Weng, Hanyi Fang, and Peter Szolovits
530
+ * [Mendeley Digital Knee
531
+ X-Ray](https://data.mendeley.com/datasets/t9ndx37v5h/1): This dataset is
532
+ from Rani Channamma University, and is hosted on Mendeley Data.
533
+ * [AfriMed-QA](https://afrimedqa.com/): This data was developed and led by
534
+ multiple collaborating organizations and researchers include key
535
+ contributors: Intron Health, SisonkeBiotik, BioRAMP, Georgia Institute of
536
+ Technology, and MasakhaneNLP.
537
+ * [VQA-RAD](https://www.nature.com/articles/sdata2018251): This dataset was
538
+ created by a research team led by Jason J. Lau, Soumya Gayen, Asma Ben
539
+ Abacha, and Dina Demner-Fushman and their affiliated institutions (the US
540
+ National Library of Medicine and National Institutes of Health)
541
+ * [Chest ImaGenome](https://physionet.org/content/chest-imagenome/1.0.0/): IBM
542
+ Research.
543
+ * [MedExpQA](https://www.sciencedirect.com/science/article/pii/S0933365724001805):
544
+ This dataset was created by researchers at the HiTZ Center (Basque Center
545
+ for Language Technology and Artificial Intelligence).
546
+ * [MedXpertQA](https://huggingface.co/datasets/TsinghuaC3I/MedXpertQA): This
547
+ dataset was developed by researchers at Tsinghua University (Beijing, China)
548
+ and Shanghai Artificial Intelligence Laboratory (Shanghai, China).
549
+ * [HealthSearchQA](https://huggingface.co/datasets/katielink/healthsearchqa):
550
+ This dataset consists of consisting of 3,173 commonly searched consumer
551
+ questions
552
+
553
+ In addition to the public datasets listed above, MedGemma was also trained on
554
+ de-identified, licensed datasets or datasets collected internally at Google from
555
+ consented participants.
556
+
557
+ * **Radiology dataset 1:** De-identified dataset of different CT studies
558
+ across body parts from a US-based radiology outpatient diagnostic center
559
+ network.
560
+ * **Ophthalmology dataset 1 (EyePACS):** De-identified dataset of fundus
561
+ images from diabetic retinopathy screening.
562
+ * **Dermatology dataset 1:** De-identified dataset of teledermatology skin
563
+ condition images (both clinical and dermatoscopic) from Colombia.
564
+ * **Dermatology dataset 2:** De-identified dataset of skin cancer images (both
565
+ clinical and dermatoscopic) from Australia.
566
+ * **Dermatology dataset 3:** De-identified dataset of non-diseased skin images
567
+ from an internal data collection effort.
568
+ * **Pathology dataset 1:** De-identified dataset of histopathology H\&E whole
569
+ slide images created in collaboration with an academic research hospital and
570
+ biobank in Europe. Comprises de-identified colon, prostate, and lymph nodes.
571
+ * **Pathology dataset 2:** De-identified dataset of lung histopathology H\&E
572
+ and IHC whole slide images created by a commercial biobank in the United
573
+ States.
574
+ * **Pathology dataset 3:** De-identified dataset of prostate and lymph node
575
+ H\&E and IHC histopathology whole slide images created by a contract
576
+ research organization in the United States.
577
+ * **Pathology dataset 4:** De-identified dataset of histopathology whole slide
578
+ images created in collaboration with a large, tertiary teaching hospital in
579
+ the United States. Comprises a diverse set of tissue and stain types,
580
+ predominantly H\&E.
581
+ * **EHR dataset 1:** Question/answer dataset drawn from synthetic FHIR records
582
+ created by [Synthea.](https://synthetichealth.github.io/synthea/) The test
583
+ set includes 19 unique patients with 200 questions per patient divided into
584
+ 10 different categories.
585
 
586
  ### Data citation
587
 
588
+ * **MIMIC-CXR:** Johnson, A., Pollard, T., Mark, R., Berkowitz, S., & Horng,
589
+ S. (2024). MIMIC-CXR Database (version 2.1.0). PhysioNet.
590
+ [https://physionet.org/content/mimic-cxr/2.1.0/](https://physionet.org/content/mimic-cxr/2.1.0/)
591
+ *and* Johnson, Alistair E. W., Tom J. Pollard, Seth J. Berkowitz, Nathaniel
592
+ R. Greenbaum, Matthew P. Lungren, Chih-Ying Deng, Roger G. Mark, and Steven
593
+ Horng. 2019\. "MIMIC-CXR, a de-Identified Publicly Available Database of
594
+ Chest Radiographs with Free-Text Reports." *Scientific Data 6* (1): 1–8.
595
+
596
+ * **SLAKE:** Liu, Bo, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu.
597
+ 2021.SLAKE: A Semantically-Labeled Knowledge-Enhanced Dataset for Medical
598
+ Visual Question Answering."
599
+ [http://arxiv.org/abs/2102.09542](http://arxiv.org/abs/2102.09542).
600
+
601
+ * **PAD-UEFS-20:** Pacheco, Andre GC, et al. "PAD-UFES-20: A skin lesion
602
+ dataset composed of patient data and clinical images collected from
603
+ smartphones." *Data in brief* 32 (2020): 106221\.
604
+
605
+ * **SCIN:** Ward, Abbi, Jimmy Li, Julie Wang, Sriram Lakshminarasimhan, Ashley
606
+ Carrick, Bilson Campana, Jay Hartford, et al. 2024\. "Creating an Empirical
607
+ Dermatology Dataset Through Crowdsourcing With Web Search Advertisements."
608
+ *JAMA Network Open 7* (11): e2446615–e2446615.
609
+
610
+ * **TCGA:** The results shown here are in whole or part based upon data
611
+ generated by the TCGA Research Network:
612
+ [https://www.cancer.gov/tcga](https://www.cancer.gov/tcga).
613
+
614
+ * **CAMELYON16:** Ehteshami Bejnordi, Babak, Mitko Veta, Paul Johannes van
615
+ Diest, Bram van Ginneken, Nico Karssemeijer, Geert Litjens, Jeroen A. W. M.
616
+ van der Laak, et al. 2017\. "Diagnostic Assessment of Deep Learning
617
+ Algorithms for Detection of Lymph Node Metastases in Women With Breast
618
+ Cancer." *JAMA 318* (22): 2199–2210.
619
+
620
+ * **Mendeley Digital Knee X-Ray:** Gornale, Shivanand; Patravali, Pooja
621
+ (2020), "Digital Knee X-ray Images", Mendeley Data, V1, doi:
622
+ 10.17632/t9ndx37v5h.1
623
+
624
+ * **VQA-RAD:** Lau, Jason J., Soumya Gayen, Asma Ben Abacha, and Dina
625
+ Demner-Fushman. 2018\. "A Dataset of Clinically Generated Visual Questions
626
+ and Answers about Radiology Images." *Scientific Data 5* (1): 1–10.
627
+
628
+ * **Chest ImaGenome:** Wu, J., Agu, N., Lourentzou, I., Sharma, A., Paguio,
629
+ J., Yao, J. S., Dee, E. C., Mitchell, W., Kashyap, S., Giovannini, A., Celi,
630
+ L. A., Syeda-Mahmood, T., & Moradi, M. (2021). Chest ImaGenome Dataset
631
+ (version 1.0.0). PhysioNet. RRID:SCR\_007345.
632
+ [https://doi.org/10.13026/wv01-y230](https://doi.org/10.13026/wv01-y230)
633
+
634
+ * **MedQA:** Jin, Di, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang,
635
+ and Peter Szolovits. 2020\. "What Disease Does This Patient Have? A
636
+ Large-Scale Open Domain Question Answering Dataset from Medical Exams."
637
+ [http://arxiv.org/abs/2009.13081](http://arxiv.org/abs/2009.13081).
638
+
639
+ * **AfrimedQA:** Olatunji, Tobi, Charles Nimo, Abraham Owodunni, Tassallah
640
+ Abdullahi, Emmanuel Ayodele, Mardhiyah Sanni, Chinemelu Aka, et al. 2024\.
641
+ "AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering
642
+ Benchmark Dataset."
643
+ [http://arxiv.org/abs/2411.15640](http://arxiv.org/abs/2411.15640).
644
+
645
+ * **MedExpQA:** Alonso, I., Oronoz, M., & Agerri, R. (2024). MedExpQA:
646
+ Multilingual Benchmarking of Large Language Models for Medical Question
647
+ Answering. *arXiv preprint arXiv:2404.05590*. Retrieved from
648
+ [https://arxiv.org/abs/2404.05590](https://arxiv.org/abs/2404.05590)
649
+
650
+ * **MedXpertQA:** Zuo, Yuxin, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu,
651
+ Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. 2025\. "MedXpertQA:
652
+ Benchmarking Expert-Level Medical Reasoning and Understanding."
653
+ [http://arxiv.org/abs/2501.18362](http://arxiv.org/abs/2501.18362).
654
 
655
  ### De-identification/anonymization:
656
 
657
+ Google and its partners utilize datasets that have been rigorously anonymized or
658
+ de-identified to ensure the protection of individual research participants and
659
+ patient privacy.
660
 
661
  ## Implementation information
662
 
 
666
 
667
  Training was done using [JAX](https://github.com/jax-ml/jax).
668
 
669
+ JAX allows researchers to take advantage of the latest generation of hardware,
670
+ including TPUs, for faster and more efficient training of large models.
671
 
672
  ## Use and limitations
673
 
674
  ### Intended use
675
 
676
+ MedGemma is an open multimodal generative AI model intended to be used as a
677
+ starting point that enables more efficient development of downstream healthcare
678
+ applications involving medical text and images. MedGemma is intended for
679
+ developers in the life sciences and healthcare space. Developers are responsible
680
+ for training, adapting and making meaningful changes to MedGemma to accomplish
681
+ their specific intended use. MedGemma models can be fine-tuned by developers
682
+ using their own proprietary data for their specific tasks or solutions.
683
+
684
+ MedGemma is based on Gemma 3 and has been further trained on medical images and
685
+ text. MedGemma enables further development in any medical context (image and
686
+ textual), however the model was pre-trained using chest X-ray, pathology,
687
+ dermatology, and fundus images. Examples of tasks within MedGemma's training
688
+ include visual question answering pertaining to medical images, such as
689
+ radiographs, or providing answers to textual medical questions. Full details of
690
+ all the tasks MedGemma has been evaluated can be found in the [MedGemma
691
+ Technical Report](https://arxiv.org/abs/2507.05201).
692
 
693
  ### Benefits
694
 
695
+ * Provides strong baseline medical image and text comprehension for models of
696
+ its size.
697
+ * This strong performance makes it efficient to adapt for downstream
698
+ healthcare-based use cases, compared to models of similar size without
699
+ medical data pre-training.
700
+ * This adaptation may involve prompt engineering, grounding, agentic
701
+ orchestration or fine-tuning depending on the use case, baseline validation
702
+ requirements, and desired performance characteristics.
703
 
704
  ### Limitations
705
 
706
+ MedGemma is not intended to be used without appropriate validation, adaptation
707
+ and/or making meaningful modification by developers for their specific use case.
708
+ The outputs generated by MedGemma are not intended to directly inform clinical
709
+ diagnosis, patient management decisions, treatment recommendations, or any other
710
+ direct clinical practice applications. Performance benchmarks highlight baseline
711
+ capabilities on relevant benchmarks, but even for image and text domains that
712
+ constitute a substantial portion of training data, inaccurate model output is
713
+ possible. All outputs from MedGemma should be considered preliminary and require
714
+ independent verification, clinical correlation, and further investigation
715
+ through established research and development methodologies.
716
+
717
+ MedGemma's multimodal capabilities have been primarily evaluated on single-image
718
+ tasks. MedGemma has not been evaluated in use cases that involve comprehension
719
+ of multiple images.
720
 
721
  MedGemma has not been evaluated or optimized for multi-turn applications.
722
 
723
+ MedGemma's training may make it more sensitive to the specific prompt used than
724
+ Gemma 3.
725
 
726
  When adapting MedGemma developer should consider the following:
727
 
728
+ * **Bias in validation data:** As with any research, developers should ensure
729
+ that any downstream application is validated to understand performance using
730
+ data that is appropriately representative of the intended use setting for
731
+ the specific application (e.g., age, sex, gender, condition, imaging device,
732
+ etc).
733
+ * **Data contamination concerns**: When evaluating the generalization
734
+ capabilities of a large model like MedGemma in a medical context, there is a
735
+ risk of data contamination, where the model might have inadvertently seen
736
+ related medical information during its pre-training, potentially
737
+ overestimating its true ability to generalize to novel medical concepts.
738
+ Developers should validate MedGemma on datasets not publicly available or
739
+ otherwise made available to non-institutional researchers to mitigate this
740
+ risk.