Time Blindness: Why Video-Language Models Can't See What Humans Can?
Abstract
SpookyBench is a benchmark for temporal pattern recognition in videos that highlights the limitations of vision-language models in processing noise-like frames without spatial information.
Recent advances in vision-language models (VLMs) have made impressive strides in understanding spatio-temporal relationships in videos. However, when spatial information is obscured, these models struggle to capture purely temporal patterns. We introduce SpookyBench, a benchmark where information is encoded solely in temporal sequences of noise-like frames, mirroring natural phenomena from biological signaling to covert communication. Interestingly, while humans can recognize shapes, text, and patterns in these sequences with over 98% accuracy, state-of-the-art VLMs achieve 0% accuracy. This performance gap highlights a critical limitation: an over-reliance on frame-level spatial features and an inability to extract meaning from temporal cues. Furthermore, when trained in data sets with low spatial signal-to-noise ratios (SNR), temporal understanding of models degrades more rapidly than human perception, especially in tasks requiring fine-grained temporal reasoning. Overcoming this limitation will require novel architectures or training paradigms that decouple spatial dependencies from temporal processing. Our systematic analysis shows that this issue persists across model scales and architectures. We release SpookyBench to catalyze research in temporal pattern recognition and bridge the gap between human and machine video understanding. Dataset and code has been made available on our project website: https://timeblindness.github.io/.
Community
Paper: https://arxiv.org/abs/2505.24867
Code: https://github.com/TimeBlindness/time-blindness
Project webpage: https://timeblindness.github.io/
Awesome work! I hope this enables people to really evaluate temporal understanding of video models and improve on it!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Mavors: Multi-granularity Video Representation for Multimodal Large Language Model (2025)
- VideoAds for Fast-Paced Video Understanding: Where Opensource Foundation Models Beat GPT-4o & Gemini-1.5 Pro (2025)
- Caption This, Reason That: VLMs Caught in the Middle (2025)
- VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning? (2025)
- Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency (2025)
- Weaving Context Across Images: Improving Vision-Language Models through Focus-Centric Visual Chains (2025)
- VideoExpert: Augmented LLM for Temporal-Sensitive Video Understanding (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper