VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction
Abstract
VIR-Bench, a new benchmark for travel videos, evaluates and enhances MLLMs' geospatial-temporal intelligence, improving itinerary recommendations in real-world applications.
Recent advances in multimodal large language models (MLLMs) have significantly enhanced video understanding capabilities, opening new possibilities for practical applications. Yet current video benchmarks focus largely on indoor scenes or short-range outdoor activities, leaving the challenges associated with long-distance travel largely unexplored. Mastering extended geospatial-temporal trajectories is critical for next-generation MLLMs, underpinning real-world tasks such as embodied-AI planning and navigation. To bridge this gap, we present VIR-Bench, a novel benchmark consisting of 200 travel videos that frames itinerary reconstruction as a challenging task designed to evaluate and push forward MLLMs' geospatial-temporal intelligence. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, struggle to achieve high scores, underscoring the difficulty of handling videos that span extended spatial and temporal scales. Moreover, we conduct an in-depth case study in which we develop a prototype travel-planning agent that leverages the insights gained from VIR-Bench. The agent's markedly improved itinerary recommendations verify that our evaluation protocol not only benchmarks models effectively but also translates into concrete performance gains in user-facing applications.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CVBench: Evaluating Cross-Video Synergies for Complex Multimodal Understanding and Reasoning (2025)
- STRIDE-QA: Visual Question Answering Dataset for Spatiotemporal Reasoning in Urban Driving Scenes (2025)
- MMSearch-Plus: A Simple Yet Challenging Benchmark for Multimodal Browsing Agents (2025)
- Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes (2025)
- EgoCross: Benchmarking Multimodal Large Language Models for Cross-Domain Egocentric Video Question Answering (2025)
- Fine-grained Spatiotemporal Grounding on Egocentric Videos (2025)
- B4DL: A Benchmark for 4D LiDAR LLM in Spatio-Temporal Understanding (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper