VLM4D: Towards Spatiotemporal Awareness in Vision Language Models
Abstract
A benchmark evaluates VLMs' spatiotemporal reasoning, identifying gaps and suggesting improvements like 4D feature field reconstruction and fine-tuning.
Vision language models (VLMs) have shown remarkable capabilities in integrating linguistic and visual reasoning but remain fundamentally limited in understanding dynamic spatiotemporal interactions. Humans effortlessly track and reason about object movements, rotations, and perspective shifts-abilities essential for robust dynamic real-world understanding yet notably lacking in current VLMs. In this paper, we introduce VLM4D, the first benchmark specifically designed to evaluate the spatiotemporal reasoning capabilities of VLMs. Our benchmark comprises diverse real-world and synthetic videos accompanied by carefully curated question-answer pairs emphasizing translational and rotational motions, perspective awareness, and motion continuity. Through comprehensive evaluations of state-of-the-art open and closed-source VLMs, we identify significant performance gaps compared to human baselines, highlighting fundamental deficiencies in existing models. Extensive analysis reveals that VLMs struggle particularly with integrating multiple visual cues and maintaining temporal coherence. We further explore promising directions, such as leveraging 4D feature field reconstruction and targeted spatiotemporal supervised fine-tuning, demonstrating their effectiveness in enhancing spatiotemporal comprehension. Our work aims to encourage deeper exploration into improving VLMs' spatial and temporal grounding, paving the way towards more capable and reliable visual intelligence for dynamic environments.
Community
The first benchmark explicitly designed to evaluate the spatiotemporal (4D) reasoning capabilities of Vision Language Models (VLMs).
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- HV-MMBench: Benchmarking MLLMs for Human-Centric Video Understanding (2025)
- EgoExoBench: A Benchmark for First- and Third-person View Video Understanding in MLLMs (2025)
- EmbRACE-3K: Embodied Reasoning and Action in Complex Environments (2025)
- ImplicitQA: Going beyond frames towards Implicit Video Reasoning (2025)
- InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models (2025)
- SIFThinker: Spatially-Aware Image Focus for Visual Reasoning (2025)
- CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper