Do generative video models learn physical principles from watching videos?
Abstract
AI video generation is undergoing a revolution, with quality and realism advancing rapidly. These advances have led to a passionate scientific debate: Do video models learn ``world models'' that discover laws of physics -- or, alternatively, are they merely sophisticated pixel predictors that achieve visual realism without understanding the physical principles of reality? We address this question by developing Physics-IQ, a comprehensive benchmark dataset that can only be solved by acquiring a deep understanding of various physical principles, like fluid dynamics, optics, solid mechanics, magnetism and thermodynamics. We find that across a range of current models (Sora, Runway, Pika, Lumiere, Stable Video Diffusion, and VideoPoet), physical understanding is severely limited, and unrelated to visual realism. At the same time, some test cases can already be successfully solved. This indicates that acquiring certain physical principles from observation alone may be possible, but significant challenges remain. While we expect rapid advances ahead, our work demonstrates that visual realism does not imply physical understanding. Our project page is at https://physics-iq.github.io; code at https://github.com/google-deepmind/physics-IQ-benchmark.
Community
Are AI video models truly understanding the world, or just creating visually appealing illusions? This is the core question behind Physics-IQ, a new benchmark designed to rigorously test the physical reasoning abilities of AI video generation models. We provide a comprehensive dataset of 396 real-world videos covering diverse physical scenarios, from fluid dynamics to solid mechanics and more.
Our benchmark challenges models to predict the future of a scene, pushing them beyond simple pattern recognition. We use novel metrics like Spatial IoU, Spatiotemporal IoU, Weighted Spatial IoU, and MSE, to evaluate different aspects of physical understanding. The Physics-IQ score aggregates these measures and is normalized to physical variance observed between real-world videos. Our findings reveal that while current models produce visually realistic videos, they exhibit a significant lack of true physical understanding.
Explore our open-source dataset, evaluation code, and detailed results. Join us in pushing the boundaries of AI and physical reasoning! The project aims to quantify physical understanding, and make it possible to track future progress in the field
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- InTraGen: Trajectory-controlled Video Generation for Object Interactions (2024)
- AniSora: Exploring the Frontiers of Animation Video Generation in the Sora Era (2024)
- FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors (2025)
- Video Creation by Demonstration (2024)
- Can Generative Video Models Help Pose Estimation? (2024)
- Prediction with Action: Visual Policy Learning via Joint Denoising Process (2024)
- REDUCIO! Generating 1024$\times$1024 Video within 16 Seconds using Extremely Compressed Motion Latents (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper