arxiv:2504.09130

VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search

Published on Apr 12

· Submitted by

LibraTree on Apr 15

Upvote

Authors:

Yikun Wang ,

Siyin Wang ,

Zhaoye Fei ,

Liang Ding ,

Qipeng Guo ,

Xipeng Qiu

Abstract

Recent advancements in Large Vision-Language Models have showcased remarkable capabilities. However, they often falter when confronted with complex reasoning tasks that humans typically address through visual aids and deliberate, step-by-step thinking. While existing methods have explored text-based slow thinking or rudimentary visual assistance, they fall short of capturing the intricate, interleaved nature of human visual-verbal reasoning processes. To overcome these limitations and inspired by the mechanisms of slow thinking in human cognition, we introduce VisuoThink, a novel framework that seamlessly integrates visuospatial and linguistic domains. VisuoThink facilitates multimodal slow thinking by enabling progressive visual-textual reasoning and incorporates test-time scaling through look-ahead tree search. Extensive experiments demonstrate that VisuoThink significantly enhances reasoning capabilities via inference-time scaling, even without fine-tuning, achieving state-of-the-art performance in tasks involving geometry and spatial reasoning.

View arXiv page View PDF GitHub repository Add to collection

Community

LibraTree

Paper author Paper submitter 1 day ago

📢 VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search

🤔 Current LVLMs struggle with complex reasoning like multi-hop geometry problems. How can AI agents utilize and construct more useful visual hints?

🔑 Key insight: When LVLMs perform reasoning, they need not only "WHAT to do" but also a mental model of "WHAT WILL HAPPEN after each action"! This brings LVLMs more powerful reasoning performance. #NextLevelAI 🤖

LibraTree

Paper author Paper submitter 1 day ago

Current methods focus on visual-aided reasoning or test scaling. Our VisuoThink framework combines both and introduces a mechanism called lookahead tree search.

LibraTree

Paper author Paper submitter 1 day ago

Through exploring different trajectories and predicting what-will-happen, LVLMs construct more reliable auxiliary lines when solving geometry problems and perform better in spatial reasoning tasks.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2504.09130 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2504.09130 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.09130 in a Space README.md to link it from this page.

VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 2