Papers
arxiv:2506.21656

Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs

Published on Jun 26
· Submitted by SivanSX on Jun 30
Authors:
,
,
,
,
,
,
,
,

Abstract

SpatialReasoner-R1, a vision-language reasoning model, uses Multi-Model Monte Carlo Tree Search and fine-grained Direct Preference Optimization to improve spatial reasoning, setting a new state-of-the-art on SPATIALRGPT-Bench.

AI-generated summary

Current Vision-Language Models (VLMs) struggle with fine-grained spatial reasoning, particularly when multi-step logic and precise spatial alignment are required. In this work, we introduce SpatialReasoner-R1, a vision-language reasoning model designed to address these limitations. To construct high-quality supervision for spatial reasoning, we design a Multi-Model Monte Carlo Tree Search (M3CTS) method that generates diverse, logically consistent Long Chain-of-Thought (LongCoT) reasoning trajectories. In addition, we propose fine-grained Direct Preference Optimization (fDPO), which introduces segment-specific preference granularity for descriptive grounding and logical reasoning, guided by a spatial reward mechanism that evaluates candidate responses based on visual consistency, spatial grounding, and logical coherence. Experimental results demonstrate that fDPO achieves an average improvement of 4.1% over standard DPO across spatial quality tasks, and a 9.0% gain in spatial quantity tasks. SpatialReasoner-R1, trained with fDPO, sets a new SoTA on SPATIALRGPT-Bench, outperforming the strongest baseline by 9.8% in average accuracy, while maintaining competitive performance on general vision-language tasks.

Community

Paper submitter

We propose a novel fine-grained preference optimization approach that significantly improves spatial reasoning capabilities in Vision-Language Models (VLMs). Our method leverages carefully designed preference data and training strategies to enhance spatial understanding without compromising general visual capabilities.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.21656 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.21656 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.21656 in a Space README.md to link it from this page.

Collections including this paper 5