Learning the Physical World from Videos: A Prospective Study on World Models

Jiawei Li; Jiarui Yang; Peidong Liu; Shu-Tao Xia; Liang Lin

doi:10.20944/preprints202604.0503.v1

Submitted:

07 April 2026

Posted:

08 April 2026

You are already at the latest version

Abstract

World models aim to enable agents to perceive states, predict future outcomes, and reason for decision-making by simulating real-world environments, and are widely regarded as a crucial pathway toward artificial general intelligence (AGI). Video, as one of the most accessible and intuitively representative media of dynamic environments, naturally contains rich implicit representations of the physical world. Consequently, learning world models from videos has become a prominent research direction. However, a significant gap remains between video data and the real physical world: videos capture only superficial visual phenomena and lack explicit representations of three-dimensional structure, physical properties, and causal mechanisms. This limitation severely constrains the physical consistency and practical applicability of world models. Motivated by this, the present work provides a prospective study of recent research in this domain, encompassing: (1) key challenges arising from the video–physical world gap and representative solutions; (2) three major construction paradigms of physical world models; (3) a thorough summary of existing evaluation benchmarks; and (4) future research directions and discussions. It is noteworthy that this study is the first to systematically examine video-driven world model research from the perspective of physical world. In contrast to prior study that primarily focus on generative modeling or provide broad overviews, this work emphasizes world models with tangible physical grounding, explicitly excluding generative tasks such as video synthesis or 3D/4D modeling that diverge conceptually from the goal of modeling the physical world. Adopting a problem-oriented perspective, this study aims to provide subsequent researchers with a systematic framework and decision-making guidance for understanding existing work, designing innovative methods, and facilitating the deployment of world models in real-world applications.

Keywords:

physical consistency

;

world models

;

video generator

;

embodied intelligence

Subject:

Computer Science and Mathematics - Robotics

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Learning the Physical World from Videos: A Prospective Study on World Models

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe