Papers
arxiv:2503.13111

MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs

Published on Mar 17
· Submitted by edaxberger on Mar 19
Authors:
,
,
,
,

Abstract

Multimodal large language models (MLLMs) excel at 2D visual understanding but remain limited in their ability to reason about 3D space. In this work, we leverage large-scale high-quality 3D scene data with open-set annotations to introduce 1) a novel supervised fine-tuning dataset and 2) a new evaluation benchmark, focused on indoor scenes. Our Cubify Anything VQA (CA-VQA) data covers diverse spatial tasks including spatial relationship prediction, metric size and distance estimation, and 3D grounding. We show that CA-VQA enables us to train MM-Spatial, a strong generalist MLLM that also achieves state-of-the-art performance on 3D spatial understanding benchmarks, including our own. We show how incorporating metric depth and multi-view inputs (provided in CA-VQA) can further improve 3D understanding, and demonstrate that data alone allows our model to achieve depth perception capabilities comparable to dedicated monocular depth estimation models. We will publish our SFT dataset and benchmark.

Community

Paper author Paper submitter

🚀 Excited to share our new work on exploring 3D Spatial Understanding with Multimodal LLMs!

Multimodal LLMs excel at 2D image understanding, but their 3D spatial reasoning remains weak — a crucial limitation for robotics, AR/VR, and other spatial AI applications. To address this, we introduce
:

📀 CA-VQA — A fine-tuning dataset & benchmark for spatial understanding that: (1) is based on high-quality 3D ground truth data, (2) covers rich input signals incl. metric depth maps (both from sensors and monocular estimated) and multi-view images, and (3) includes diverse spatial understanding tasks (e.g., spatial relationship prediction, metric object size / distance estimation, 3D bounding box prediction).


🤖 MM-Spatial — A strong generalist MLLM trained on CA-VQA + other multimodal fine-tuning datasets (incl. general VQA, knowledge, text-rich), resulting in state-of-the-art performance on spatial understanding tasks. The model supports Chain-of-Thought-style spatial reasoning involving 2D bounding box prediction and monocular depth estimation (which we found to be very accurate), and it can also leverage depth map input via tool-use.


🔗 Check out our preprint for more details and results, and stay tuned for our data release: https://arxiv.org/abs/2503.13111

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.13111 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.13111 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.13111 in a Space README.md to link it from this page.

Collections including this paper 1