arxiv:2508.00549

Your other Left! Vision-Language Models Fail to Identify Relative Positions in Medical Images

Published on Aug 1

Authors:

Daniel Wolf ,

Abstract

Vision-Language Models struggle with determining relative positions in medical images, relying more on prior knowledge than image content, and visual prompts offer limited improvement.

AI-generated summary

Clinical decision-making relies heavily on understanding relative positions of anatomical structures and anomalies. Therefore, for Vision-Language Models (VLMs) to be applicable in clinical practice, the ability to accurately determine relative positions on medical images is a fundamental prerequisite. Despite its importance, this capability remains highly underexplored. To address this gap, we evaluate the ability of state-of-the-art VLMs, GPT-4o, Llama3.2, Pixtral, and JanusPro, and find that all models fail at this fundamental task. Inspired by successful approaches in computer vision, we investigate whether visual prompts, such as alphanumeric or colored markers placed on anatomical structures, can enhance performance. While these markers provide moderate improvements, results remain significantly lower on medical images compared to observations made on natural images. Our evaluations suggest that, in medical imaging, VLMs rely more on prior anatomical knowledge than on actual image content for answering relative position questions, often leading to incorrect conclusions. To facilitate further research in this area, we introduce the MIRP , Medical Imaging Relative Positioning, benchmark dataset, designed to systematically evaluate the capability to identify relative positions in medical images.

View arXiv page View PDF Project page GitHub 1 Add to collection

Community

Wolfda95

Paper author 5 days ago

•

edited 4 days ago

Can we really trust AI models like GPT to write radiology reports or even assist in planning surgeries?

🤔 Well… not so fast.

In our study accepted at MICCAI 2025, we found:

➡️ Current AI models cannot reliably detect relative positions of anatomical structures in medical images. Without that capability, how should they describe a location in a medical report? 🚨

➡️ Even more alarming: They often ignore the image and fall back on memorized language patterns. For example, when asked, “Is the liver to the right of the stomach?” they just say yes, because they learned that’s usually true in most humans, even though the image clearly shows otherwise. ⚠️

These findings raise important questions about whether such models are ready for critical clinical tasks.

For more details about the paper, watch our YouTube Video 🎥

👉 What are your experiences with Vision-Language Models for Medical Imaging tasks?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.00549 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.00549 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.