Your other Left! Vision-Language Models Fail to Identify Relative Positions in Medical Images
Abstract
Vision-Language Models struggle with determining relative positions in medical images, relying more on prior knowledge than image content, and visual prompts offer limited improvement.
Clinical decision-making relies heavily on understanding relative positions of anatomical structures and anomalies. Therefore, for Vision-Language Models (VLMs) to be applicable in clinical practice, the ability to accurately determine relative positions on medical images is a fundamental prerequisite. Despite its importance, this capability remains highly underexplored. To address this gap, we evaluate the ability of state-of-the-art VLMs, GPT-4o, Llama3.2, Pixtral, and JanusPro, and find that all models fail at this fundamental task. Inspired by successful approaches in computer vision, we investigate whether visual prompts, such as alphanumeric or colored markers placed on anatomical structures, can enhance performance. While these markers provide moderate improvements, results remain significantly lower on medical images compared to observations made on natural images. Our evaluations suggest that, in medical imaging, VLMs rely more on prior anatomical knowledge than on actual image content for answering relative position questions, often leading to incorrect conclusions. To facilitate further research in this area, we introduce the MIRP , Medical Imaging Relative Positioning, benchmark dataset, designed to systematically evaluate the capability to identify relative positions in medical images.
Community
Can we really trust AI models like GPT to write radiology reports or even assist in planning surgeries?
🤔 Well… not so fast.
In our study accepted at MICCAI 2025, we found:
➡️ Current AI models cannot reliably detect relative positions of anatomical structures in medical images. Without that capability, how should they describe a location in a medical report? 🚨
➡️ Even more alarming: They often ignore the image and fall back on memorized language patterns. For example, when asked, “Is the liver to the right of the stomach?” they just say yes, because they learned that’s usually true in most humans, even though the image clearly shows otherwise. ⚠️
These findings raise important questions about whether such models are ready for critical clinical tasks.
For more details about the paper, watch our YouTube Video 🎥
👉 What are your experiences with Vision-Language Models for Medical Imaging tasks?
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper