arxiv:2501.09038

Do generative video models learn physical principles from watching videos?

Published on Jan 14

· Submitted by

sam-motamed on Jan 17

Upvote

Authors:

Saman Motamed ,

Kevin Swersky ,

Robert Geirhos

Abstract

AI video generation is undergoing a revolution, with quality and realism advancing rapidly. These advances have led to a passionate scientific debate: Do video models learn ``world models'' that discover laws of physics -- or, alternatively, are they merely sophisticated pixel predictors that achieve visual realism without understanding the physical principles of reality? We address this question by developing Physics-IQ, a comprehensive benchmark dataset that can only be solved by acquiring a deep understanding of various physical principles, like fluid dynamics, optics, solid mechanics, magnetism and thermodynamics. We find that across a range of current models (Sora, Runway, Pika, Lumiere, Stable Video Diffusion, and VideoPoet), physical understanding is severely limited, and unrelated to visual realism. At the same time, some test cases can already be successfully solved. This indicates that acquiring certain physical principles from observation alone may be possible, but significant challenges remain. While we expect rapid advances ahead, our work demonstrates that visual realism does not imply physical understanding. Our project page is at https://physics-iq.github.io; code at https://github.com/google-deepmind/physics-IQ-benchmark.

View arXiv page View PDF Add to collection

Community

sam-motamed

Paper author Paper submitter about 24 hours ago

Are AI video models truly understanding the world, or just creating visually appealing illusions? This is the core question behind Physics-IQ, a new benchmark designed to rigorously test the physical reasoning abilities of AI video generation models. We provide a comprehensive dataset of 396 real-world videos covering diverse physical scenarios, from fluid dynamics to solid mechanics and more.
Our benchmark challenges models to predict the future of a scene, pushing them beyond simple pattern recognition. We use novel metrics like Spatial IoU, Spatiotemporal IoU, Weighted Spatial IoU, and MSE, to evaluate different aspects of physical understanding. The Physics-IQ score aggregates these measures and is normalized to physical variance observed between real-world videos. Our findings reveal that while current models produce visually realistic videos, they exhibit a significant lack of true physical understanding.
Explore our open-source dataset, evaluation code, and detailed results. Join us in pushing the boundaries of AI and physical reasoning! The project aims to quantify physical understanding, and make it possible to track future progress in the field

librarian-bot

about 7 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2501.09038 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2501.09038 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2501.09038 in a Space README.md to link it from this page.