arxiv:2411.16778

GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis

Published on Nov 25, 2024

Authors:

Abstract

GEMeX is a large-scale, explainable, and multi-format chest X-ray VQA dataset that enhances answer comprehensibility and practical usability, demonstrating improved performance when fine-tuning existing models.

AI-generated summary

Medical Visual Question Answering (Med-VQA) combines computer vision and natural language processing to automatically answer clinical inquiries about medical images. However, current Med-VQA datasets exhibit two significant limitations: (1) they often lack visual and textual explanations for answers, hindering comprehension for patients and junior doctors; (2) they typically offer a narrow range of question formats, inadequately reflecting the diverse requirements in practical scenarios. These limitations pose significant challenges to the development of a reliable and user-friendly Med-VQA system. To address these challenges, we introduce a large-scale, Groundable, and Explainable Medical VQA benchmark for chest X-ray diagnosis (GEMeX), featuring several innovative components: (1) a multi-modal explainability mechanism that offers detailed visual and textual explanations for each question-answer pair, thereby enhancing answer comprehensibility; (2) four question types, open-ended, closed-ended, single-choice, and multiple-choice, to better reflect practical needs. With 151,025 images and 1,605,575 questions, GEMeX is the currently largest chest X-ray VQA dataset. Evaluation of 12 representative large vision language models (LVLMs) on GEMeX reveals suboptimal performance, underscoring the dataset's complexity. Meanwhile, we propose a strong model by fine-tuning an existing LVLM on the GEMeX training set. The substantial performance improvement showcases the dataset's effectiveness. The benchmark is available at https://www.med-vqa.com/GEMeX.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2411.16778 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2411.16778 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2411.16778 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.