Papers
arxiv:2507.16863

Pixels, Patterns, but No Poetry: To See The World like Humans

Published on Jul 21
· Submitted by HongchengGao on Jul 24
#2 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

The Turing Eye Test evaluates MLLMs' perceptual abilities through synthetic images, revealing that vision tower generalization is a significant gap compared to human perception.

AI-generated summary

Achieving human-like perception and reasoning in Multimodal Large Language Models (MLLMs) remains a central challenge in artificial intelligence. While recent research has primarily focused on enhancing reasoning capabilities in MLLMs, a fundamental question persists: Can Multimodal Large Language Models truly perceive the world as humans do? This paper shifts focus from reasoning to perception. Rather than constructing benchmarks specifically for reasoning, we introduce the Turing Eye Test (TET), a challenging perception-oriented benchmark comprising four diagnostic tasks that evaluate MLLMs' performance on synthetic images that humans process intuitively. Our findings reveal that state-of-the-art MLLMs exhibit catastrophic failures on our perceptual tasks trivial for humans. Both in-context learning and training on language backbone-effective for previous benchmarks-fail to improve performance on our tasks, while fine-tuning the vision tower enables rapid adaptation, suggesting that our benchmark poses challenges for vision tower generalization rather than for the knowledge and reasoning capabilities of the language backbone-a key gap between current MLLMs and human perception. We release a representative subset of TET tasks in this version, and will introduce more diverse tasks and methods to enhance visual generalization in future work.

Community

Paper submitter

solid work!

solid work!

solid work!

solid work!

about the topic, topological vision seems like a better alternative to 'by pixels' approach, could be possible with risper pylib

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.16863 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.16863 in a Space README.md to link it from this page.

Collections including this paper 8