Predicting the World via Video Representation: A Comprehensive Survey on Video World Models

Jiaxin Yan; Chaoning Zhang; Xudong Wang; Pengcheng Zheng; Ya Wen; Qigan Sun; Jiaxin Huang; Shuxu Chen; Yang Yang; Hyundong Shin

doi:10.20944/preprints202605.0435.v1

Submitted:

06 May 2026

Posted:

07 May 2026

You are already at the latest version

Abstract

Video world models have emerged as a critical framework, offering a powerful approach to modeling dynamic environments through lens of video data, and serving as a key tool for understanding and predicting complex systems. While prior papers have focused on specific domains such as 3D modeling, autonomous driving, and robotics, they have largely overlooked the growing importance of video modality in the development of future world models. These papers often concentrate on particular data representations, failing to account for how video-based representations can bridge the gap between perception, prediction, and decision-making in intelligent systems. This paper aims to fill this gap by providing a standardized and systematic classification of video world models. We introduce a comprehensive taxonomy that distinguishes between implicit state deduction which, focuses on learning compact latent representations and explicit visual modeling, which emphasizes frame level video processing. Additionally, we analyze indepth review of experimental setups, specific applications, and open problems. By focusing on video world models, this paper offers a unified reference that highlights their critical role in the future of world modeling research.

Keywords:

video world models

;

reinforcement learning

;

self-supervised learning

;

imitation learning

;

diffusion models

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Predicting the World via Video Representation: A Comprehensive Survey on Video World Models

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe