Learning the Physical World from Videos: A Prospective Study on World Models

Jiawei Li; Jiarui Yang; Peidong Liu; Shu-Tao Xia; Liang Lin

doi:10.20944/preprints202604.0503.v2

Submitted:

14 April 2026

Posted:

16 April 2026

You are already at the latest version

Abstract

World models aim to enable agents to perceive states, predict future outcomes, and reason for decision-making by simulating real-world environments, and are widely regarded as a crucial pathway toward artificial general intelligence (AGI). Video, as one of the most accessible and intuitively representative media of dynamic environments, naturally contains rich implicit representations of the physical world. Consequently, learning world models from videos has become a prominent research direction. However, a significant gap remains between video data and the real physical world: videos capture only superficial visual phenomena and lack explicit representations of three-dimensional structure, physical properties, and causal mechanisms. This limitation severely constrains the physical consistency and practical applicability of world models. Motivated by this, the present work provides a prospective study of recent research in this domain, encompassing: (1) key challenges arising from the video–physical world gap and representative solutions; (2) three major construction paradigms of physical world models; and (3) future research directions and discussions. It is noteworthy that this study is the first to systematically examine video-driven world model research from the perspective of physical world. In contrast to prior study that primarily focus on generative modeling or provide broad overviews, this work emphasizes world models with tangible physical grounding, explicitly excluding generative tasks such as video synthesis or 3D/4D modeling that diverge conceptually from the goal of modeling the physical world. Adopting a problem-oriented perspective, this study aims to provide subsequent researchers with a systematic framework and decision-making guidance for understanding existing work, designing innovative methods, and facilitating the deployment of world models in real-world applications.

Keywords:

physical consistency

;

world models

;

video generator

;

embodied intelligence

Subject:

Computer Science and Mathematics - Robotics

1. Introduction

World models refer to computational frameworks in which agents understand and predict external dynamics through internal environment simulations, thereby supporting perception, reasoning, and decision-making [1]. As a key pathway toward artificial general intelligence (AGI), world models aim to capture the physical laws, social interaction mechanisms, and environmental uncertainties inherent in the real world, enabling agents to perform efficient planning and adaptive behaviors in complex and dynamic scenarios. In practical applications, world models demonstrate broad potential: in robotics [2], they facilitate agents’ understanding and simulation of interactions with the environment; in autonomous driving [3,4,5], they enable the prediction of future traffic scenarios and potential risks; and in game AI [6,7], they support long-horizon strategy reasoning and multi-step planning.

Figure 1. This paper reviews five key challenges and three promising paradigms of learning the physical world models.

Over the past decade, with the rapid advancement of computational power, data scale, and model performance, research on world models has achieved significant progress. In 2018, Ha and Schmidhuber [1] proposed a world model framework that integrates a Variational Autoencoder (VAE) with a Recurrent Neural Network (RNN) [8], marking a paradigm shift from traditional symbolic representations to data-driven approaches. Subsequently, the Dreamer series of methods [9,10,11] and Joint Embedding Predictive Architectures (JEPA) [12] introduced latent state modeling and predictive learning, demonstrating the feasibility and advantages of latent dynamics for planning and decision-making tasks. Benefiting from the rise of large-scale models, world models have entered a phase of rapid iteration since 2024, accompanied by a surge in related research publications. Representative systems such as Sora2 [13] and Genie3 [14] have significantly advanced the practical realization of learning world models directly from video data, further highlighting the potential of this research direction.

Video, as the most intuitive and readily accessible medium for capturing the dynamics of the physical world, has increasingly emerged as a natural substrate for constructing world models. It inherently encodes temporal continuity, motion patterns, and rich contextual information, thereby providing an indispensable foundation for learning structured representations of the environment. However, video is fundamentally a 2D projection of reality—it captures only superficial visual phenomena and lacks explicit representations of three-dimensional structure, physical properties (e.g., mass, friction, elasticity), contact interactions, and underlying causal mechanisms. This intrinsic limitation causes video-driven world models to remain largely confined to the level of visual pattern fitting, giving rise to the so-called “pixel-to-physics” gap [15]. Nevertheless, compared to real-world data acquisition, video data is safer, more cost-effective, and can support virtually unlimited experimentation within simulated environments. Bridging this gap is therefore a critical research direction toward developing physically grounded and practically deployable world models.

Existing surveys have provided valuable insights into the field of world models, yet they are often limited to specific application domains, generative tasks, or high-level framework overviews, and generally lack a systematic analysis from the perspective of physical consistency [16]. For instance, the surveys by Tu et al. [3], Feng et al. [4], and Guan et al. [5] on autonomous driving world models emphasize applications in perception, prediction, and planning, but treat video merely as an auxiliary input and do not specifically address biases arising from missing causal information. Ding et al. [17] and Zhu et al. [18] explore, at a broader level, the potential of extending generative models toward generalized world models, yet their focus lies primarily on delineating the boundaries of generation and simulation, overlooking in-depth analyses of physical modeling mechanisms. Some studies concentrate on physics simulation in embodied intelligence [19,20], but these are constrained to specific simulated environments or emphasize functional outcomes rather than the underlying gaps. With the rapid evolution of world model research, there is an urgent need for a survey that systematically reviews video-driven world model paradigms and key challenges, with physical consistency as the guiding principle.

Motivated by the above, this work presents a systematic survey of video-driven physical world models from a problem-oriented perspective, explicitly excluding directions that deviate from the goal of modeling the physical world (e.g., pure video generation or 3D/4D reconstruction [21]). Specifically, our discussion is organized along three key dimensions:

Pixel–Physics Challenges: We distill five core challenges—continuity, controllability, generalization, lightweight, and universality—and systematically summarize the sub-problems and representative solutions for each challenge through the lens of physical consistency.
Three Paradigms of Physical World Model: Existing approaches toward physically grounded modeling can be broadly categorized into three classes: prior injection, dynamic–static decoupling, and hierarchical abstraction.
Future Directions: We identify five major open problems for future research and provide a systematic discussion on industrial deployment and current safety issues.

Through this systematic organization, the survey aims to illuminate the key challenges and opportunities in bridging the gap from pixels to physics, providing a clear research framework and guidance for developing the next generation of world models with true physical understanding.

2. Challenges of Learning from Video

Videos naturally encode spatiotemporal continuity, object interactions, event evolution, and latent causal chains, making them highly valuable for capturing physical and dynamic processes. Large-scale video generation models even exhibit an implicit adherence to real-world constraints to some extent. However, as 2D projections, videos discard critical 3D geometry, object properties, and causal mechanisms, causing world models trained on them to overfit superficial visual patterns rather than underlying physical principles. This, in turn, gives rise to a series of core challenges in consistency, controllability, generalization, physical grounding, efficiency, and universality, as illustrated in Figure 2 and Table 1.

2.1. Physical Continuity

2.1.1. Temporal Continuity

The gap in temporal continuity arises because videos capture only discrete snapshots and lack explicit causal signals, causing autoregressive world models to accumulate prediction errors and break continuous state evolution. Short temporal horizons and stepwise error compounding further prevent generating long-range physically consistent sequences [22]. Existing solutions can be summarized as follows:

Autoregressive improvements. Conditioning on preceding frames (single or multiple) provides a simple yet efficient means of ensuring continuity and has been widely adopted in recent works [23,24,25]. Several improvements have been proposed: EnerVerse [26] introduces a sparse keyframe memory mechanism that predicts future video blocks based on block-level context, effectively avoiding the redundant storage and error propagation associated with continuous memory while preserving long-term dependencies. EVA [27], on the other hand, incorporates a reflection-based conditioning mechanism into a block-level autoregressive framework, enabling the model to adaptively extend video length. Yume [28] combines block-level autoregression with FramePack-based historical context compression and hierarchical sampling, achieving theoretically unlimited interactive generation and effectively mitigating temporal discontinuities in long sequences.

Diffusion schedule optimization. Most diffusion-based approaches typically assume that all frames share the same noise level [29,30]. Recent studies, however, demonstrate that optimizing noise schedules—such as diffusion forcing [31,32]—facilitates autoregressive long-horizon generation [33]. GEM [34] maintains temporal continuity in long video generation by denoising in multiple stages, conditioning on previous frames, and enforcing progressively increasing noise levels over time during training.

Conditional constraints. Imposing physical structural constraints on world models substantially enhances temporal physical continuity. Pathdreamer [35] employs a hierarchical two-stage architecture, first generating semantic and depth maps as structural contexts before predicting future frames conditioned on them. This layered strategy improves long-term fidelity since predicting structural representations grounded in physical geometry is inherently more continuous than directly generating RGB pixels. PlayerOne [36] adopts a joint reconstruction framework that simultaneously models 4D scenes and video frames, thereby ensuring physically consistent scene continuity. VRAG [23] incorporates global state information (e.g., character coordinates and poses) and retrieves relevant frames from a historical buffer, allowing the model to leverage past contexts and spatial awareness for improved temporal continuity.

Optimization-level. Regularization strategies and loss-based enforcement provide direct optimization routes to penalize physical discontinuities in learned representations. SSD [37] employs a state-space modeling framework to efficiently handle long-horizon sequential information, enabling the extraction of rich long-term contextual dependencies. Vid2World [38] modifies both architecture and training objectives—particularly through adjustments to temporal attention layers and temporal convolution weight sharing—to enable causal generation and autoregressive capability in video diffusion models, ensuring predictions depend only on past physical states. SGF [39] enforces temporal continuity by minimizing the mean squared error between consecutive observations while introducing variance–covariance regularization to prevent representation collapse.

Table 2. Summary of Works on Physical Continuity Challenges. Modalities: Preprints 208438 i005

indicates video or multi-frame; Preprints 208438 i006

denotes single-frame or image; Preprints 208438 i007

refers to latent representation; Preprints 208438 i008

to spatial representation; Preprints 208438 i009

to text modality; Preprints 208438 i010

to action representation; Preprints 208438 i011

to camera (pose) information; Preprints 208438 i012

to depth feature; Preprints 208438 i013

to neural signal; Preprints 208438 i014

to object-level feature; Preprints 208438 i015

to physical information; and Preprints 208438 i016

to historical memory feature. Evaluations: V assesses visual quality, generation, prediction, and control in downstream tasks, encompassing qualitative evaluations. R evaluates robotic tasks in simulation, while

\hat{R}

includes real-robot experiments. P measures physical understanding and perception. C evaluates planning and decision-making in games, tasks, and navigation.

Table 2. Summary of Works on Physical Continuity Challenges. Modalities: Preprints 208438 i005

indicates video or multi-frame; Preprints 208438 i006

denotes single-frame or image; Preprints 208438 i007

refers to latent representation; Preprints 208438 i008

to spatial representation; Preprints 208438 i009

to text modality; Preprints 208438 i010

to action representation; Preprints 208438 i011

to camera (pose) information; Preprints 208438 i012

to depth feature; Preprints 208438 i013

to neural signal; Preprints 208438 i014

to object-level feature; Preprints 208438 i015

to physical information; and Preprints 208438 i016

to historical memory feature. Evaluations: V assesses visual quality, generation, prediction, and control in downstream tasks, encompassing qualitative evaluations. R evaluates robotic tasks in simulation, while

\hat{R}

includes real-robot experiments. P measures physical understanding and perception. C evaluates planning and decision-making in games, tasks, and navigation.

Work	Venue	Main Solution	Method	Input / Output	Conditions	Evals
Temporal Consistency
EnerVerse [26]	NeurIPS’25	Autoregression Improvements	Sparse Chunks	/		V $\hat{R}$
EVA [27]	arXiv’25		Reflection	/		V $\hat{R}$
Yume [28]	arXiv’25		Framepack	/		V
SAMPO [24]	NeurIPS’25		Scale-Wise	/	None	V $\hat{R}$
GEM [34]	CVPR’25	Diffusion Schedules	Increasing Noise	/		V
Diamond [33]	NeurIPS’24		Adaptive Noise	/	None	V
Epona [32]	ICCV’25		Diffusion Forcing	/		VC
Pathdreamer [35]	ICCV’21	Condition Constraints	HR. Modeling	/		VC
PlayerOne [36]	NeurIPS’25		Rec. Constraints	/		V
VRAG [23]	NeurIPS’25		Global Constraints	/	None	CV $\hat{R}$
Vid2World [38]	ICLR’26	Optimization Level	Loss-based	/		V
SSD [37]	NeurIPS’25		State-space	/	None	C
SGF [39]	ICLR’25		Regularization	/	None	C
Spatial Consistency
RoboScape [40]	NeurIPS’25	Implicit Alignment	HR. Modeling	/	None	VR
WorldGrow [41]	AAAI’26		Block Inpainting	/		V
WVD [42]	CVPR’25	Explicit Alignment	Spatial Joint Modeling	/	None	V
FlashWorld [43]	ICLR’26		Dual-mode Pre-training	/		V
Geom. Forcing [44]	ICLR’26		Rep. Alignment	/	None	V
InfiniCube [45]	ICCV’25		HR. Constraints	/		V
MindJourney [46]	NeurIPS’25		Language Guidance	/		VCP
UniFuture [47]	ICRA’26		Multi-modal	/		V
Edeline [48]	NeurIPS’25		Mem. Enhancement	/	None	C
Ctrl-World [49]	ICLR’26		Space Constraints	/		V
Spatial-Mem [50]	NeurIPS’25		Semantic Alignment	/		V
WorldMEM [51]	NeurIPS’25	Memory Mechanism	Memory Bank	/		V
Voyager(LLM) [52]	TMLR’24		Skill Library	/		C
SSM-World [53]	ICCV’25		State-Space Models	/		V
Identity Consistency
Loci-v1 [54]	ICLR’23	Occlusion	Imagination Tracking	/		C
SAVi++ [55]	NeurIPS’22	Tracking	Identity Tracking	/	None	P
ForeDiff [56]	arXiv’25	Anchors	Arch. Decoupling	/		V $\hat{R}$

2.1.2. Spatial Continuity

Videos are 2D projections that discard crucial 3D geometry, causing world models to miss physically grounded spatial relationships. Human spatial perception uses multi-view associations and long-term memory [44], inspiring recent methods to recover 3D geometry and preserve long-range spatial dependencies through memory mechanisms.

Implicit representation alignment. These methods do not directly parameterize 3D structures but instead encode physical spatial correlations implicitly within the model through cross-modal fusion or attention mechanisms. For instance, EnerVerse [26] introduces cross-view spatial attention, leveraging camera intrinsics/extrinsics and ray direction maps to model view correspondences and enhance holistic multi-view generation. RoboScape [40] jointly learns from RGB and depth, using depth features as geometric constraints to implicitly acquire 3D physical scene priors rather than merely fitting 2D images. Similarly, GEM [34] and DeepVerse [57] integrate depth, semantics, RGB, and dynamic masks to improve spatial physical continuity. GAIA-2 [58], as a surround-view world model, ensures multi-camera spatial continuity by aligning streams via structured conditions such as environmental factors and road semantics, enabling high-resolution spatiotemporally coherent generation.

Explicit representation alignment. These approaches provide explicit physical spatial information to supervise joint distributions, spatial guidance, and alignment. WVD [42] encodes global 3D coordinates into spatial pixels, learning the joint distribution of 2D space and 3D coordinates from 6D (RGB+XYZ) video with explicit geometric supervision. Geometry Forcing [44] aligns intermediate video model representations with features from pretrained geometric foundation models, guiding the model to internalize physical geometric alignment over latent perspectives and scales. Pathdreamer [35] accumulates past observations into a 3D point cloud and reprojects it into 2D space as context, explicitly reasoning about the 3D physical geometry of the next frame. InfiniCube [45] constructs an external voxel-based “3D ground buffer” from videos, providing explicit physical grounding to mitigate spatial drift in long sequences. DSG-World [59] explicitly builds a 3D Gaussian world model from dual-state video observations, introducing dual-segmentation-aware Gaussian fields, pseudo-intermediate symmetric alignment, and cooperative pruning/pasting to maintain geometric integrity under occlusions. Voyager-DM [60] aligns RGB with depth and employs efficient world caching for long-range geometrically consistent 3D reconstruction. Spatial-Mem [50] leverages external geometry-grounded 3D world representations (e.g., point clouds) to filter dynamic regions while explicitly memorizing static spatial structures.

Memory mechanisms. Many approaches incorporate external memory banks or retrieval modules [52] to store historical physical observations, mitigating error accumulation and preserving spatial continuity. WorldMem [51] leverages a memory bank to store past visual and state features, achieving long-term 3D physical continuity through a memory attention mechanism. Spatial-Mem [50], in contrast, maintains sparse keyframes as episodic memory and dynamically expands when new regions appear, thereby enhancing the preservation of spatial relationships. Furthermore, PEWM-3D [61] continuously integrates observations into a shared 3D feature map (e.g., Plücker coordinate embeddings), which is referenced throughout the generation phase to ensure globally continuous spatial coherence. In addition, SSM-World [53] introduces structured memory along the spatial dimension via a block-wise scanning State Space Model (SSM) mechanism, partitioning the global space into controllable units. This design balances temporal memory with spatial physical continuity while incorporating frame-level local attention to enhance generation quality.

2.1.3. Identity Continuity

World models that operate directly in pixel space often fail to capture the physically invariant properties of objects across frames. Since pixel-level representations are highly sensitive to noise, lighting changes, and viewpoint variation, they cannot maintain the stable attributes — such as shape, material, and appearance — that define object identity in the physical world, leading to identity drift in long sequences. Fine-grained object-level modeling is therefore crucial for maintaining physically grounded identity continuity across frames [62]. In this regard, the GEM model [34] introduces “identity embeddings” for objects to eliminate operation ambiguities and adopts customized integration mechanisms for different types of control signals (such as self-motion, object manipulation, and human posture), ensuring strict physical identity continuity between generated content and diverse control signals. Similarly, FOCUS [63] allocates a unique one-hot identity vector

s_{obj}^{t}

to each object, and the object latent extractor generates stable latent representations

s_{obj}^{t}

based on this identity, thereby ensuring identity continuity. Loci-v1 [54] and SAVi++ [55] approach identity continuity by decomposing the scene into multiple object slots. Similarly, ForeDiff’s [56] prediction flow learns during pretraining how to precisely identify and preserve physically grounded identity-related features of objects, such as spatial location, shape, and appearance, from video frames, providing an “identity anchor.” Recently, the unprecedented success of Sora2 [13] in cross-scene multi-view identity continuity demonstrates the immense potential of learning physically consistent representations from videos.

2.2. Controllability

The controllability gap exists because videos passively record events and cannot represent action-conditioned responses, causal interventions, or counterfactuals. Without this, models cannot predict “what if” scenarios, limiting policy exploration and intervention. Current research addresses this limitation by introducing explicit modeling of action–outcome mappings to ensure that environmental states respond to interventions in accordance with physical laws, while also striving to construct interactive world models.

2.2.1. Semantic Control

Since videos lack explicit representations of the semantics underlying scene changes — such as object material properties, force interactions, and spatial affordances — world models struggle to ground high-level semantic instructions in physically consistent outputs. This approach aims to bridge semantic intent and physical scene evolution by integrating structured semantic inputs, text-conditioned guidance, and multimodal semantic alignment mechanisms [64], effectively preventing semantic drift and physically inconsistent generation. Models such as InfiniCube [45], and GEM [34] achieve this by directly injecting structured physical and semantic conditions—such as road semantics, high-definition maps, vehicle bounding boxes, textual prompts, or DINOv2 [65] features—into the generative model. This enables the predictable manifestation of advanced semantic instructions, including multi-agent interaction, multi-camera consistency, dynamic object insertion, weather condition adjustments, and object movement or insertion, while maintaining physical plausibility in the generated scenes.

For finer semantic interaction, Yume [28] utilizes novel sampling strategies and stochastic differential equations to enhance the controllability of text conditions, and achieves physically grounded camera control through quantized camera trajectories. LaDi-WM [66] innovatively designs an interactive diffusion process that dynamically adjusts DINO geometric latent codes and SigLip semantic latent codes, aligning geometry with semantics to accurately model the physically consistent evolution of scene semantics within world dynamics. FlowDreamer [67] adopts 3D scene flow as a universal motion representation, capturing physically grounded semantic deformations of non-rigid objects, and supports fine-grained semantic motion control that respects plausibility. Pathdreamer [35] avoids semantic ambiguity in camera modeling by directly specifying future viewpoint trajectory sequences grounded in spatial geometry. Additionally, NWM [68] implements real-time semantic planning and semantic command compliance in robotic operations using CDiT [69] architectures with semantic constraints, energy function-encoded semantic rules, and action tree-structured instructions, ensuring that semantic control signals translate into physically executable robot behaviors.

2.2.2. Interactivity

Interactivity enables world models not only to passively predict but also to actively respond to interventions and instructions, thereby supporting causal reasoning and goal-directed policy exploration [70]. Existing studies mainly focus on two directions: (I) interaction video generation driven by local trajectories or action signals (emphasizing sequence-level action conditioning, suitable for stepwise prediction in robotics or games); and (II) real-time explorable or 3D world models driven by global signals (emphasizing holistic prompts or multimodal inputs, suitable for immersive simulation).

The first line of research emphasizes conditioning video generation models on low-level control signals (e.g., robot joint actions, game inputs, or reward feedback) to enable stepwise interactive prediction. Representative works include Vid2World [38], AVID [71], iVideoGPT [72], DWS [73], AirScape [74], and UnifoLM-WMA-0 [75], which employ action-injection mechanisms to map user-provided control signals into sequences of future video frames, thereby supporting embodied interactive exploration such as robotic trajectory evolution. These approaches essentially constitute controllable video-driven world models, establishing a tight coupling between low-level actions and environmental dynamics. In addition, some works further focus on multi-task trajectory integration and generalization. For instance, WLA [76] aggregates cross-environment trajectories to achieve continuous dynamic prediction, while the IAFM [77] extracts action signals from large-scale robotic video trajectories to support cross-task and cross-domain interactive simulation.

The second line of research is driven by global conditions (e.g., text, natural language instructions, or multimodal prompts), aiming to construct navigable and explorable spatial world representations (3D or 4D) and extend video generation into embodied, immersive, or panoramic scene simulation. For example, TesserAct [78] employs instruction-driven grid evolution mechanisms for controllable environment generation, whereas PlayerOne [36] models first-person-view (POV) videos to support navigation. LatticeWorld [79] addresses static 3D world modeling by seamlessly integrating with an industrial-grade computer graphics rendering engine. In virtual environments, several studies investigate open-world simulation based on game scenarios: MineWorld [80] and Hunyuan-Game [6,81] are specifically designed for game trajectories, supporting agent-controlled interactive exploration; while Voyager [60], Mirage 2 [82], and Matrix Game [7] further develop real-time explorable game-world modeling. Moreover, some works explicitly incorporate 3D spatial structures into generative models: for instance, YUME [28], Marble [83], and Matrix-3D [84] combine omnidirectional video generation with interactive control, enabling real-time 360° scene exploration driven by external devices.

2.3. Generalization

Although large-scale video data exist, annotated datasets for embodied intelligence and 3D environments remain scarce. Videos capture appearances rather than underlying physical laws, causing overfitting to visual patterns and poor coverage of rare dynamics. Recent work uses data augmentation, improved architectures, and domain generalization to enhance generalization and cross-scene adaptability.

2.3.1. Data

Data Synthesis and Utilization. A typical approach involves fine-tuning generative models to synthesize diverse new data. For example, DREAMGEN [85] employs a video-based world model to generate both familiar and novel tasks across diverse environments, while integrating a latent action model or inverse dynamics model to infer pseudo-action sequences, forming “neural trajectories.” A more advanced framework, Dream to Manipulate [64], introduces a learnable digital twin that combines Gaussian splatting [86] with simulators to generate new environments and action configurations with physically consistent dynamics, thereby expanding the diversity of the training distribution. As shown in Figure 3, the EnerVerse-D data engine [26] integrates world models with 4D Gaussian Splatting (4DGS) to establish a self-reinforcing data generation loop capable of producing high-quality, multi-view video data with geometrically consistent scene reconstruction.

In addition, many studies explore unsupervised or weakly supervised paradigms. GEM [34] leverages large-scale cross-domain data and pseudo-labels to generate depth maps, trajectories, and poses, improving instance-level perception and generalization. FLARE [87] enhances temporal representation learning by leveraging structured representations and action-aware future embeddings derived from unlabeled data. VLMWM [88] employs a dynamic modeling mechanism to automatically generate pseudo-labels, thereby strengthening the model’s supervisory signals for dynamics.

Sim2Real Transfer. Since simulators can explicitly model physical properties such as friction, mass, and contact dynamics, simulation-to-reality transfer offers a principled way to inject physical knowledge into world models. SRCC [89] introduces the Sim2Real Correlation Coefficient to quantify a simulator’s predictive fidelity, optimizing simulation parameters to better align dynamics with real-world performance. SimWorld [90], a simulator-conditioned scene generation engine, integrates scene construction and simulation modules to produce synthetic data and labels with physically accurate properties for world model training. Other methods exploit general semantic alignment to bridge the appearance gap. In addition, generative editability-based approaches, such as Cosmos-transfer [91] and Dreamland [92], leverage multi-conditional multimodal pretraining to achieve photorealistic domain transfer while preserving physical plausibility.

Foundation Model Priors. Cross-scene generalization can be enhanced by leveraging the prior knowledge encoded in pretrained video or multimodal foundation models. For example, UWM [29] and Vidar [93] achieve broad semantic understanding and efficient domain transfer by fine-tuning on only a small amount of domain-specific data. Similarly, Founder [94] maps foundation model representations into the world model latent state space, enabling world models to handle cross-domain scenarios and multimodal tasks. Notably, DINO-WM [95] trains a world model on offline behavioral data using DINOv2-based pretrained visual features, thereby enabling task-agnostic zero-shot planning grounded in physically meaningful visual representations. In addition, Geometry Forcing [44] exploits pretrained 3D foundation models for physically grounded representation learning, alleviating the modeling bottlenecks caused by limited 3D supervision.

2.3.2. Architecture Generalization

One class of approaches introduces structural modeling of visual or geometric inputs using discrete tokens, virtual intermediate states, or hierarchical structures to better capture physically invariant patterns. WorldDreamer [96] learns world dynamics through masked-token prediction, while DSG-World [59] reconciles observational discrepancies via geometric constraints and pseudo-states that encode structure. HMBRL [97] adopts hierarchical models with low-dimensional abstract actions to generalize across complex tasks. Another class emphasizes regularization and robust training to address uncertainty and drift in latent physical representations. MoSim [98] introduces residual flow penalties to avoid physically uncertain regions, while SDE-based [99] and Jacobian-regularized frameworks analyze and correct latent drift errors in physical state representations, thereby improving robustness. A third line of work focuses on multimodal conditioning and policy structures: GenRL [100] achieves task generalization with vision/language prompts, and BPN [101] separates state and task representations with bilinear fusion to enhance transferability and robustness under environmental shifts.

2.3.3. Behavioral and Environmental Generalization

Behavioral Generalization. aims to enhance the adaptability of world models in cross-task and cross-platform action modeling, where distribution shifts in dynamics across embodiments pose a central challenge. RUWM [102] enhances behavioral robustness by analyzing latent representation errors and introducing regularization over physical state transitions. 3DFlowAction [30] and FlowDreamer [67] address challenges across heterogeneous robotic platforms, complex manipulation tasks, and novel objects by leveraging 3D optical flow as a physically grounded motion representation and large-scale pretraining. AdaWorld [103] employs latent action self-supervision to extract and recombine actions, thereby enabling flexible control and behavioral generalization across heterogeneous environments.

Environmental Generalization. focuses on platform differences and adaptation to new environments. Vidar unifies observation spaces across different robotic platforms by integrating multi-view video data, and introduces a masked inverse dynamics mechanism to extract action features from generated videos, achieving few-shot generalization in dual-arm manipulation under varied configurations. DreamerV3 [11] leverages robustness techniques to handle multimodal returns and sparse rewards, improving adaptability in novel environments; its extension, cRSSM [104], systematically incorporates contextual information into DreamerV3, further strengthening generalization to unseen settings. AirSpace [74] adopts a two-stage training scheme—intent-controllable modeling and spatiotemporal constraint learning—to address distribution mismatches in physical dynamics and limited diversity in embodied intelligence for aerial domains. MoSim [98] decouples physics modeling from policy learning via a neural motion simulator, effectively mitigating challenges posed by unseen scenarios and distribution shifts in dynamics, thus enhancing cross-task generalization.

Table 3. Hardware and configuration comparison for lightweight world-model approaches (for reference only).

Work	Venue	GPUs	Batch Size	Training Steps
DINO-world [105]	arXiv’25	H100*16	1024	350K Iter.
HWM [106]	arXiv’25	A6000*2	128	–
MinD [107]	arXiv’25	A40*4	–	9 Hours
Sparse Imagin. [108]	ICLR’26	3090*4	32	100 Epochs
Simulus [109]	arXiv’25	4090*1	8	100 Epochs
EMERALD [110]	ICML’25	3090*1	16	–
$D^{2}$ -World [111]	arXiv’24	V100*8	24	24 Epochs
AVID [71]	RLC’25	A100*4	64	7 Days
ScaleZero [112]	arXiv’25	A100*8	512	–
KeyWorld [113]	arXiv’25	A800*8	1	100 Epochs
TWIST [114]	ICRA’24	3090*1	–	500K Iter.
IRIS [115]	ICLR’23	A100*8	256	3.5 Days
$Δ$ -IRIS [116]	ICML’24	A100*1	32	1K Epochs
HERO [117]	arXiv’25	A100*1	–	–
PosePilot [118]	IROS’25	A100*8	-	–
OCWM [119]	ICLR’25	H100*4	32	40 Epochs

2.4. Lightweight

Lightweighting world models is crucial due to high spatial-temporal overheads that limit real-time tasks like robotics and autonomous driving. Beyond efficiency, reducing model size helps filter out visually redundant information, forcing models to capture physically meaningful dynamics rather than irrelevant details. Compact architectures act as inductive biases for learning concise, generalizable, and physics-grounded representations. Existing methods focus on maximizing computational and sample efficiency while preserving essential physical dynamics. Methodologies can be grouped into several complementary strategies:

Learning in structured latent spaces. Since physical dynamics are low-dimensional relative to raw pixel space, learning in structured latent spaces provides a natural way to separate physically meaningful state representations from visual redundancy. Akbulut et al. [120] provided early experimental evidence that structured latent representations significantly improve both efficiency and performance, achieving over a 50% gain compared with modeling directly in observation space. Recent works such as DINO-World [105], MinD [107], and HWM [106] similarly compress high-dimensional pixel streams into compact latent spaces and perform temporal modeling therein, leveraging VQ-VAE [121], denoised latent representations, or low-resolution latent features to fundamentally reduce the computational and memory overhead of sequence modeling while retaining physically relevant state information.

Shortening sequence length via tokenization, sparsification, and parallelism. Physical dynamics are often locally sparse — most video frames contain physically uninformative redundancy between keystate transitions. Methods such as Sparse Imagination (random token dropping and grouped sparse attention) [108], Simulus (modular tokenization) [109], and

Δ

-IRIS (reduced key-frame encoding with autoregressive optimization) [116] lessen redundant temporal dependencies through token sparsification, concentrating model capacity on significant state changes. In parallel, approaches like EMERALD (MaskGIT-style parallel prediction) [110] and D²-World (non-autoregressive single-stage occupancy prediction) [111] further reduce the cumulative cost of autoregressive training by enabling parallel or masked generation across time.

Parameter-efficient strategies for deployment and transfer. Models including AVID [71], PosePilot [118], and LAMP [122] adapt tasks through lightweight adapters, black-box masking, or plug-and-play modules, avoiding modification — or requiring only minimal fine-tuning — of pretrained backbones that already encode broad physical priors. ScaleZero [112] balances capacity and efficiency via dynamic parameter scaling and staged LoRA expansion. KEYWORLD [113] concentrates computation on a sparse set of semantically and critical frames and employs lightweight convolutional models to synthesize intermediate frames, while TWIST [114] distills a privileged-state teacher world model into a student trained solely on domain-randomized visual observations, effectively reducing both training time and data requirements while preserving generalization.

These approaches have demonstrated substantial improvements in training/inference speed, model size, and downstream control sample efficiency. Nevertheless, they share several trade-offs rooted in the tension between fidelity and computational economy: the invertibility and fidelity of latent representations constrain the upper bound of variable recovery and fine-grained dynamics modeling; sparsification and non-autoregressive strategies face challenges in maintaining long-horizon continuity and handling multimodal uncertainty in state transitions; and black-box adaptation methods exhibit limited generalization when faced with cross-domain transfer involving substantially different environments or dynamics distributions beyond pretraining [123].

Table 4. Summary of Works on Physical Grounding Challenges. Note: Abbreviations used in the table — HR. denotes hierarchical representation. Modalities: Preprints 208438 i005

indicates video or multi-frame; Preprints 208438 i006

denotes single-frame or image; Preprints 208438 i007

refers to latent representation; Preprints 208438 i008

to spatial representation; Preprints 208438 i009

to text modality; Preprints 208438 i010

to action representation; Preprints 208438 i011

to camera (pose) information; Preprints 208438 i012

to depth feature; Preprints 208438 i013

to neural signal; Preprints 208438 i014

to object-level feature; Preprints 208438 i015

to physical information; and Preprints 208438 i016

to historical memory feature. Evals and Downstream Apps: Physical generation, question answering, interaction, understanding, attributes are abbreviated as

P_{G, Q, I, U, A}

, A refers to action prediction. M stands for motion planning. F stands for fluid dynamics. In downstream applications: W is real world, R is robotics, D is autonomous driving, and O is objects.

Table 4. Summary of Works on Physical Grounding Challenges. Note: Abbreviations used in the table — HR. denotes hierarchical representation. Modalities: Preprints 208438 i005

indicates video or multi-frame; Preprints 208438 i006

denotes single-frame or image; Preprints 208438 i007

refers to latent representation; Preprints 208438 i008

to spatial representation; Preprints 208438 i009

to text modality; Preprints 208438 i010

to action representation; Preprints 208438 i011

to camera (pose) information; Preprints 208438 i012

to depth feature; Preprints 208438 i013

to neural signal; Preprints 208438 i014

to object-level feature; Preprints 208438 i015

to physical information; and Preprints 208438 i016

to historical memory feature. Evals and Downstream Apps: Physical generation, question answering, interaction, understanding, attributes are abbreviated as

P_{G, Q, I, U, A}

, A refers to action prediction. M stands for motion planning. F stands for fluid dynamics. In downstream applications: W is real world, R is robotics, D is autonomous driving, and O is objects.

Work	Venue	Method	Input / Output	Conditions	Evals and Apps
Explicit Priors & Feedback Integration
Pandora [125]	arXiv’24	Physical Prompts	/		$P_{G} - W$
WorldGPT [126]	MM’24	Modality Alignment	/		$P_{G} - W$
LLMPhy [127]	arXiv’24	Engine Integration	/		$P_{Q} - O$
DrivePhysica [128]	arXiv’24	Positional Constraints	/		$P_{G} - D$
PhysTwin [129]	ICCV’25	Attribute Fusion	/	None	$A - R O$
SlotPi [130]	SIGKDD’25	Physical Constraints	/	None	$P_{G, Q} F - D$
S2-SSM [131]	arXiv’25	Sparse Regularization	/	None	$P_{I} - O$
RenderWorld [132]	ICRA’25	Pretraining	/		$M - D$
DINO-WM [95]	ICML’25	Pretraining Priors	/	None	$M - O R$
HERMES [133]	ICCV’25	Multi-view Modeling	/		$A - D$
Cosmos [91]	arXiv’25	Multimodal Constraints	/		$P_{Q} - W$
Disentangling Static and Dynamic Factors
AdaWorld [103]	ICML’25	Action Decoupling	/		$P_{I, M} - O$
Dyn-O [62]	NeurIPS’25	Dynamic Decoupling	/		$A - O$
ContextWM [134]	NeurIPS’23	Dynamic Decoupling	/		$P_{I} A M - D R$
DisWM [135]	ICCV’25	Dynamic Decoupling	/	None	$P_{A, U} - O$
DreamDojo [67]	RAL’26	Explicit Action Modeling	/		$A M - R$
DreamZero [30]	arXiv’26	Action Decoupling	/		$A - R$
OC-STORM [136]	arXiv’25	Object Extraction	/	None	$P_{U} M - O$
AD3 [137]	ICML’24	Action Decoupling	/		$P_{U} M - O$
LongDWM [138]	arXiv’25	Action Decoupling	/		$P_{G} - D$
Vidar [93]	arXiv’25	Action Decoupling	/		$P_{G} - R$
DREAMGEN [85]	arXiv’25	Pseudo Action Estimation	/		$P_{G} - R$
VLMWM [88]	arXiv’25	Fine-tuning	/	None	$P_{G} A - W$
WorldDreamer [96]	arXiv’24	Disentangled Modeling	/		$P_{G} - W$
Simulus [109]	arXiv’25	Dynamic Decoupling	/		$M - O$
SCALOR [139]	ICLR’20	Background Modeling	/		$P_{U} A - O$
AETHER [140]	ICCV’25	Unified Modeling	/	None	$P_{G} A - W$
UWM [29]	RSS’25	Action Decoupling	/	None	$P_{G} A - R$
FLARE [87]	CoRL’25	Unified Modeling	/		$A - W$
Progressive Constraints & Hierarchical Abstraction
DWS [73]	AAAI’26	Regularization	//		$P_{G, I} - R O$
Dreamland [92]	arXiv’25	Engine Simulation	/		$P_{U, I} - D$
GWM [141]	ICCV’25	Hierarchical Abstraction	//		$A - R$
PIWM [142]	arXiv’24	Interpretability	/	None	$P_{U, I} - O$
Ross et al. [143]	ICLR’25	Theoretical Framework	/	None	$P_{A} - O$
SimWorld [90]	arXiv’25	Simulation-based Modeling	/		$P_{G, U} - D$
MoSim [98]	CVPR’25	Multi-constraint	/	None	$A - R$
WALL-E [144]	NeurIPS’25	Rule Learning	/		$M - O$
FOLIAGE [145]	arXiv’25	Hierarchical Abstraction	/		$P_{A} - O$
LLMPHY [127]	arXiv’24	Hierarchical Abstraction	/		$P_{A, I} - O$
V-JEPA 2 [12]	arXiv’25	Hierarchical pretraining	/		$A M - R$

2.5. Universality

The lack of unification arises because videos capture only limited slices of reality, lacking a shared cross-modal physical representation. Truly universal world models rely on shared physical abstractions that generalize across domains. The Plato hypothesis [146] suggests vision, language, and action are projections of the same underlying reality, implying a unified semantic space. Contemporary models leverage this by mapping heterogeneous modalities into shared physical representations, enabling cross-modal consistency, knowledge transfer, and multi-task robustness.

As shown in Figure 4, Yue et al.’s unified framework [124] decomposes the system into five components: interaction and perception; unified reasoning (dynamics, causality, explicit and latent reasoning); memory; environment (learnable and generative); and multimodal generation (video, image, audio, 3D, prediction). This framework exhibits important complementarities with Eq. (1)–(3) in this paper: while Eq. (1)–(3) define the formal backbone of world models, the above framework explicitly incorporates memory modules and multimodal generative capabilities, thereby providing broader coverage for emerging variants of world models

Data pretraining and shared latent-space modeling form the core technical foundation for generality. DINO-world [105] leverages large-scale, uncurated web videos for pretraining, combined with the DINOv2 [65] visual encoder to perform latent-space predictions. Its compact Transformer autoregressive architecture supports variable frame rates and context lengths, extracting physically generalizable knowledge from diverse data and demonstrating seamless cross-task adaptability. WorldDreamer [96] frames world modeling as an unsupervised visual sequence problem, integrating text, image, and action modalities through masked token prediction, with parallel masked-token prediction enhancing efficiency and exhibiting high generalization and flexibility across tasks. V-JEPA 2 [12] performs self-supervised pretraining on Internet videos combined with limited robotic data to learn visually predictable representations of physical dynamics, enabling zero-shot planning and future prediction across downstream tasks. WPT [147] systematically leverages reward-free, non-expert, multi-robot, uncurated offline data, filling in action dimensions to pretrain a single world model across embodied agents. Cosmos [91], as a foundational world model, is pretrained on diverse video and physical scenarios to acquire universal structure and causal patterns, and after finetuning, can generate, predict, and simulate future states, handle cross-modal inputs, and produce controllable outputs, providing a unified framework for AI applications.

Modular design and multi-task shared representations provide architectural guarantees for generality. DreamerV3 [11] jointly learns environment models to predict action outcomes and plan ahead, using normalization and balancing techniques to handle diverse reward distributions and fixed hyperparameters for stable multi-task learning, demonstrating general adaptability without task-specific tuning. AETHER [140] integrates 4D dynamic reconstruction, action prediction, and visual planning within a single physical backbone, emphasizing geometric reasoning and achieving zero-shot generalization to real environments from synthetic data, while its global action representation further unifies and enhances generalization in navigation and robotic tasks. HERMES [133] unifies 3D scene understanding and future evolution via bird’s-eye representations, preserving geometric relations while integrating multi-view spatial information, and introduces a “world query” mechanism to fuse physical knowledge within a unified architecture. UniFuture [47] employs dual latent-variable sharing and multi-scale interaction to integrate future generation and depth perception, refining cross-modal features and generating temporally consistent future images and depth maps from single frames, exhibiting zero-shot generalization in unseen environments.

Flexible architectural extensions and instruction-based control further expand the applicability of generality. Simulus [109] uses a modular unified design, independently handling tokenizers, embedding tables, and prediction heads for each modality, decoupling representation learning from physical dynamics modeling and supporting various modality combinations, achieving sample-efficient generality across benchmarks. 1X World Model [148] predicts environment changes from specific action trajectories, capturing cross-task patterns and promoting generalization through multi-task training, enabling unified planning and decision-making across physical tasks. Pandora [125] integrates pretrained language and video models in a two-stage unified training, with instruction finetuning enabling real-time natural language control over video generation, demonstrating cross-domain unification and controllability in indoor, outdoor, and gaming environments. UWM [29] unifies action and video diffusion within a Transformer architecture, decoupling diffusion step execution, policy learning, and world modeling, learning causal and physical understanding from heterogeneous data and supporting flexible unified reasoning.

Figure 5. Three paradigms for explicit physical knowledge injection in world models. Left: integrating domain knowledge via explicit physical priors and simulation feedback to mitigate hallucinations in dynamic prediction. Middle: disentangling physically invariant (static) and physically variant (dynamic) latent factors for scene representation. Right: progressive constraints with hierarchical abstractions, from pixel-level fidelity to abstract causal structures.

3. Three Paradigm of Physical World Model

Physical knowledge in videos is often implicitly present and challenging for models to learn effectively. Explicit physical embedding endows world models with physical perception—the ability to understand physical laws, object properties, and causal dynamics—and is crucial for bridging the pixel–physics gap. Existing works can be broadly categorized into three paradigms of explicit injection, as follows:

3.1. Learning from Physical Priors

A prominent strategy is to incorporate physical priors and simulation feedback, anchoring the world model within classical mechanics frameworks to mitigate hallucinations in dynamic prediction. By explicitly injecting domain knowledge—such as object properties, conservation laws, or interaction constraints—models are guided toward physically plausible trajectories even under partial observability or long-horizon rollout [149]. This paradigm is particularly important in safety-critical applications, where purely data-driven models often suffer from compounding errors and unrealistic predictions.

Representative methods include: Pandora [125] and WorldGPT [126], which leverage LLMs to parse textual descriptions of physical properties (e.g., stiffness, friction) and translate them into perceptual priors within simulated environments; LLMPhy [127], which iteratively optimizes scene parameters via LLM-based program synthesis and physics engine feedback to support complex reasoning tasks; and DrivePhysica [128], which explicitly models motion understanding and spatial relationships in autonomous driving scenarios using coordinate alignment, instance flow guidance, and bounding-box conditioning. PhysTwin [129] integrates spring–mass models, sparse-to-dense optimization, and Gaussian splatting to construct digital twins of deformable objects. Other approaches, such as SlotPi [130], introduce Hamiltonian constraints into spatiotemporal reasoning modules to enforce energy-consistent predictions, while S2-SSM [131] infers causal graphs from object interactions using sparsity-regularized state-space models. Renderworld [132] and TRANSDREAMER [150] further enhance physical consistency via 2D–3D occupancy mapping and transformer-based long-range modeling, respectively. These methods exemplify the broader shift toward hybrid neuro-symbolic systems that combine differentiable physics priors with learned representations. Notably, GeoPT [151] proposes a synthetic-dynamics pretraining paradigm that lifts static geometry into a dynamic space, enabling models to acquire physical intuition from unlabeled particle trajectory evolution.

3.2. Learning from Action Decoupling

Another strategy for enhancing physical perception is to explicitly decompose scene representations into physically invariant (static) and physically variant (dynamic) components. This decomposition reflects a physics-inspired inductive bias: background structures and scene geometry typically follow stable constraints over time, while foreground objects evolve under dynamic forces and interactions [152]. By separating these factors, world models can reduce interference between unrelated signals, leading to improved robustness, generalization, and interpretability in complex environments.

Several recent works operationalize this idea through structured latent representations and disentangled dynamics modeling. Dyn-O [62] employs a Mamba-based state-space model [153] to disentangle object-centric features, enabling stable prediction in cluttered scenes. ContextWM [134] separates perception into a context encoder that extracts temporally invariant cues and a dynamic module that models action-dependent evolution. 3DFlowAction [30] leverages 3D optical flow to unify motion representations across different agents, providing object-centric motion signals independent of the actor identity. OC-STORM [136] extracts precise object states and fuses them with raw visual input to mitigate background noise and repeated recognition errors, while AD3 [137] explicitly isolates task-relevant dynamics from distractors in control scenarios. More recently, latent action disentanglement and embedding approaches [103,154], exemplified by DreamDojo [155], learn compact and transferable action representations directly from human videos. These methods significantly improve embodied physical reasoning by aligning latent action spaces with underlying physical causality.

3.3. Hierarchical Progressive Learning

Hierarchical modeling with progressively enforced supervision has emerged as a scalable framework for physical world modeling, enabling representations to evolve from low-level sensory fidelity to high-level causal abstraction. This paradigm reflects the multi-scale nature of physical systems, where local interactions (e.g., pixel motion or contact dynamics) aggregate into global structures and long-term dependencies. By organizing learning across multiple levels of abstraction, hierarchical models can better capture both fine-grained physical details and coarse-grained semantic regularities.

Representative approaches include RoboScape [40], which dynamically samples salient keypoints (e.g., robots or manipulable objects) to implicitly encode deformation and material properties through temporal consistency; and DWS [73], which emphasizes motion-aware supervision to prioritize dynamic over static features. Dreamland [92] introduces hierarchical and controllable world abstractions that enable precise manipulation at both pixel and object levels, while PSI [156] employs a three-phase loop—predictor training, structure extraction (e.g., optical flow and depth), and reintegration—to achieve structured and controllable representations. From a theoretical perspective, the system-theoretic framework of Ross et al. [143] formulates world-model learning as low-dimensional temporal projection and tokenization, spanning models from classical least-squares regression to generative approaches such as GANs [157].

Recent methods further integrate geometry and physics into hierarchical pipelines. GWM [141] and PIWM [142] combine Gaussian splatting propagation with physics-equation mappings to achieve interpretable 3D state prediction and reasoning. Notably, hierarchical modeling has been widely recognized for its advantages in physical interpretability, modularity, and adaptability across tasks. V-JEPA2 [12], for instance, encodes videos into high-level semantic representations, filters out stochastic or texture-level noise, and learns stable temporal dynamics via masked prediction and multi-scale modeling. Through this progressive abstraction process, such models gradually approximate real-world state transitions and underlying physical laws within the latent space.

4. Future Directions and Discussion

4.1. Open Challenges

Deep Integration of Physics Engines and Neural Networks. Current video-driven world models primarily rely on large-scale data pretraining to implicitly learn physical laws. However, they still exhibit significant limitations in out-of-distribution generalization and in modeling complex phenomena such as contact dynamics and non-rigid deformations. Hybrid paradigms, exemplified by ContactGaussian-WM [158], integrate differentiable physics engines (e.g., MuJoCo [159], PhysX [160], Isaac [161]) with 3DGS representations, enabling more physically consistent modeling through the synergy of physical priors and neural networks. Nevertheless, several key challenges remain, including differentiability bottlenecks (e.g., discontinuities in contact events that hinder gradient propagation), computational efficiency (e.g., mismatches between simulation time scales and video frame rates), and the difficulty of accurately inferring material properties from visual observations.

Interpretability and Verifiability of World Models. Most existing world models operate as end-to-end black-box systems. While effective for perception and generation tasks, their lack of interpretability hinders deployment in safety-critical scenarios such as surgical robotics and nuclear facility inspection. Approaches such as PIWM [142] attempt to align prediction processes with explicit physical equations (e.g., Newtonian mechanics and energy conservation), incorporating physically meaningful latent variables, symbolic regression, and formal verification to enhance interpretability and verifiability. Recently, the divergence between JEPA [12] and generative paradigms has raised a fundamental question: is pixel-level prediction necessary for learning physical laws, or should models instead operate in abstract representation spaces to improve efficiency and generalization?

Cross-Domain Transfer.Section 2.5 discussed recent progress toward unified paradigms across multiple tasks, environments, and frameworks. However, achieving seamless generalization across domains—such as gaming, robotic manipulation, autonomous driving, and the real world—remains an open problem. First, general-to-specialized adaptation remains challenging. For instance, although the “pretraining–adaptation” paradigm (e.g., Waymo’s driving model [162]) has shown promise, heterogeneity in physical dynamics across domains (e.g., vehicle dynamics vs. robotic manipulators vs. human motion) limits the effectiveness of straightforward fine-tuning. Second, unified action spaces are still lacking. Cross-embodiment transfer is constrained by differences in action representations. Promising directions include the development of universal intermediate action representations, embodiment-agnostic motion primitives, and hierarchical action abstractions [155]. Third, commonsense-driven transfer remains underexplored. Works such as DisWM [135] and DINO-WM [95] have taken initial steps, but systematic transfer of physical commonsense still lacks robust evaluation protocols and unified modeling frameworks.

Standardization of Evaluation Metrics. The visual realism of world models does not guarantee physical correctness, and strong task performance does not necessarily imply genuine understanding of physical laws [163]. Although efforts such as WorldScore [164] and WorldModelBench [165] aim to establish comprehensive evaluation frameworks, no community-wide consensus has yet been reached. Recent initiatives, including the CVPR 2025 WorldModelBench Workshop and the ICLR 2026 World Models Workshop, indicate that the community is actively moving toward standardized evaluation methodologies.

Data Bottlenecks and Synthetic Data Closed Loops. High-quality datasets must exhibit rich physical diversity and include multimodal signals such as depth, force/torque, and joint states, along with high-precision annotations. However, existing datasets remain insufficient in these aspects. Approaches such as NVIDIA Cosmos [91] and EnerVerse [26] propose a closed-loop data paradigm—“real → simulation → generation → augmentation of real data”—which improves data efficiency. Nevertheless, this paradigm introduces the risk of amplifying physical biases through generative models, highlighting the need for independent physical validation mechanisms.

4.2. Industrialization and Deployment

World models are rapidly transitioning from academic research to industrial deployment, forming a synergistic ecosystem across foundational platforms, vertical applications, and data infrastructures. According to a Frost & Sullivan white paper, over 80% of autonomous driving algorithm development pipelines have incorporated world models or simulation for training and validation.

On the platform level, NVIDIA’s Cosmos [91] has evolved into a comprehensive toolchain encompassing multimodal data management, video tokenization, pretrained models, and inference services. Google DeepMind’s Genie 3 [14] pushes world models toward real-time interaction, demonstrating general-purpose capabilities in autonomous driving and game generation. Meanwhile, World Labs [83] has productized 3D scene generation through Marble as a SaaS platform for creators, validating a viable B2C commercialization pathway. In vertical domains, Waymo’s world model [162] built upon Genie 3 demonstrates multi-sensor fusion, hierarchical control, and long-tail scenario generation, representing a benchmark implementation of the “general pretraining + domain adaptation” paradigm. Similarly, Huawei’s WEWA architecture and NIO’s Neural World Model (NWM) system emphasize embedding world models into closed-loop perception–prediction–planning pipelines, enabling online simulation and decision optimization. On the data side, large-scale robotic datasets such as AgiBot-World are lowering entry barriers for the industry.

4.3. Safety and Ethical Challenges

The International AI Safety Report 2026 [166] highlights that current AI systems exhibit “striking limitations” in physical world reasoning: while performing well within training distributions, they may fail catastrophically in edge-case physical scenarios. Representative examples include multi-vehicle collisions on icy roads in autonomous driving and millimeter-scale operational errors in surgical robotics.

Moreover, although synthetic data closed-loop training (see Section 2.3.1) improves data diversity, most existing studies overlook the risk of recursive bias amplification. Specifically, a generative model G, trained on a real dataset

D

, produces synthetic data

\tilde{D} = G (D)

that inevitably inherits statistical biases from

D

, while also introducing new biases arising from the model’s inductive priors. These biases accumulate through nonlinear feedback loops, leading to systematic under representation or overestimation in specific physical scenarios. For instance, in rare robotic manipulation settings—such as handling irregular soft objects, left-handed operations, or cluttered environments—synthetic data may reinforce physical inconsistencies rather than correct them if diversity is not explicitly enforced.

Finally, unlike conventional deepfake content that often lacks strong physical consistency, the ultimate objective of world models is to construct highly realistic world simulators. At that stage, the indistinguishability of fine-grained physical details may enable malicious applications, including public opinion manipulation, scientific fraud, and data poisoning attacks targeting physical AI systems.

5. Conclusions

This paper provides a systematic overview of the cutting-edge field of video-driven physical world models. The in-depth analysis was conducted from four different perspectives: the modeling paradigms, the pixel–physics gap and outlooks.

Regarding the pixel–physics gap, we identified and analyzed five core challenges: continuity, controllability, generalization, lightweight, and universality. For each challenge, we summarized representative solutions and research progress. These analyses show that while the videos provide rich learning resources for world models, they also lead to a series of gaps that need to be bridged through modeling paradigms and training strategies.

Physical consistency is not only a technological challenge, but also the most critical bottleneck for world models to evolve from a “video predictor” to a “trustworthy physical simulator.” In general, this article reveals the trends and several potential pathways for the evolution towards modeling the physical world.

Conflicts of Interest

`The authors declare no conflicts of interest.

References

Ha, D.; Schmidhuber, J. World models. arXiv 2018, arXiv:1803.101222. [Google Scholar]
Zhang, P.F.; Cheng, Y.; Sun, X.; Wang, S.; Zhu, L.; Shen, H.T. A Step Toward World Models: A Survey on Robotic Manipulation. arXiv 2025, arXiv:2511.02097. [Google Scholar] [CrossRef]
Tu, S.; Zhou, X.; Liang, D.; Jiang, X.; Zhang, Y.; Li, X.; Bai, X. The role of world models in shaping autonomous driving: A comprehensive survey. arXiv 2025, arXiv:2502.10498. [Google Scholar] [CrossRef]
Feng, T.; Wang, W.; Yang, Y. A survey of world models for autonomous driving. arXiv 2025, arXiv:2501.11260. [Google Scholar] [CrossRef]
Guan, Y.; Liao, H.; Li, Z.; Hu, J.; Yuan, R.; Zhang, G.; Xu, C. World models for autonomous driving: An initial survey. IEEE Transactions on Intelligent Vehicles, 2024. [Google Scholar]
Li, J.; Tang, J.; Xu, Z.; Wu, L.; Zhou, Y.; Shao, S.; Yu, T.; Cao, Z.; Lu, Q. Hunyuan-GameCraft: High-dynamic Interactive Game Video Generation with Hybrid History Condition. arXiv 2025, arXiv:2506.17201. [Google Scholar]
Zhang, Y.; Peng, C.; Wang, B.; Wang, P.; Zhu, Q.; Kang, F.; Jiang, B.; Gao, Z.; Li, E.; Liu, Y.; et al. Matrix-Game: Interactive World Foundation Model. arXiv 2025, arXiv:2506.18701. [Google Scholar] [CrossRef]
Medsker, L.R.; Jain, L.; et al. Recurrent neural networks. Design and applications 2001, 5, 2. [Google Scholar]
Hafner, D.; Lillicrap, T.; Ba, J.; Norouzi, M. Dream to Control: Learning Behaviors by Latent Imagination. In Proceedings of the International Conference on Learning Representations, 2020. [Google Scholar]
Hafner, D.; Lillicrap, T.; Norouzi, M.; Ba, J. Mastering atari with discrete world models. arXiv 2020, arXiv:2010.02193. [Google Scholar]
Hafner, D.; Pasukonis, J.; Ba, J.; Lillicrap, T. Mastering diverse domains through world models. arXiv 2023, arXiv:2301.04104. [Google Scholar]
Assran, M.; Bardes, A.; Fan, D.; Garrido, Q.; Howes, R.; Muckley, M.; Rizvi, A.; Roberts, C.; Sinha, K.; Zholus, A.; et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv 2025, arXiv:2506.09985. [Google Scholar]
OpenAI. Sora 2 is here. 2025. Available online: https://openai.com/index/sora-2/ (accessed on 2025-06-05).
DeepMind. Genie 3: A new frontier for world models. 2025. Available online: https://deepmind.google/discover/blog/genie-3-a-new-frontier-for-world-models/ (accessed on 2025-06-05).
Ma, X.; Shen, Y.; Liu, P.; Zhan, J. Recent Advances, Critical Reflections, and Future Directions in Large-Scale Group Decision-Making: A Comprehensive Survey. In IEEE Transactions on Systems, Man, and Cybernetics: Systems; 2025. [Google Scholar]
Yue, J.; Huang, Z.; Chen, Z.; Wang, X.; Wan, P.; Liu, Z. Simulating the Visual World with Artificial Intelligence: A Roadmap. arXiv 2025, arXiv:2511.08585. [Google Scholar] [CrossRef]
Ding, J.; Zhang, Y.; Shang, Y.; Zhang, Y.; Zong, Z.; Feng, J.; Yuan, Y.; Su, H.; Li, N.; Sukiennik, N.; et al. Understanding world or predicting future? a comprehensive survey of world models. ACM Computing Surveys, 2024. [Google Scholar]
Zhu, Z.; Wang, X.; Zhao, W.; Min, C.; Deng, N.; Dou, M.; Wang, Y.; Shi, B.; Wang, K.; Zhang, C.; et al. Is sora a world simulator? a comprehensive survey on general world models and beyond. arXiv 2024, arXiv:2405.03520. [Google Scholar] [CrossRef]
Lin, M.; Wang, X.; Wang, Y.; Wang, S.; Dai, F.; Ding, P.; Wang, C.; Zuo, Z.; Sang, N.; Huang, S.; et al. Exploring the evolution of physics cognition in video generation: A survey. arXiv 2025, arXiv:2503.21765. [Google Scholar] [CrossRef]
Liu, D.; Zhang, J.; Dinh, A.D.; Park, E.; Zhang, S.; Mian, A.; Shah, M.; Xu, C. Generative physical ai in vision: A survey. arXiv 2025, arXiv:2501.10928. [Google Scholar] [CrossRef]
Xie, N.; Tian, Z.; Yang, L.; Zhang, X.P.; Guo, M.; Li, J. From 2D to 3D Cognition: A Brief Survey of General World Models. arXiv 2025, arXiv:2506.20134. [Google Scholar] [CrossRef]
Hu, M.; Zhu, M.; Zhou, X.; Yan, Q.; Li, S.; Liu, C.; Chen, Q. Efficient text-driven motion generation via latent consistency training. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2026. [Google Scholar]
Chen, T.; Hu, X.; Ding, Z.; Jin, C. Learning World Models for Interactive Video Generation. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Google Scholar]
Wang, S.; Tian, J.; Wang, L.; Liao, Z.; lijiayi; Dong, H.; Xia, K.; Zhou, S.; Tang, W.; Hua, G. SAMPO: Scale-wise Autoregression with Motion Prompt for Generative World Models. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Google Scholar]
Xiang, J.; Gu, Y.; Liu, Z.; Feng, Z.; Gao, Q.; Hu, Y.; Huang, B.; Liu, G.; Yang, Y.; Zhou, K.; et al. PAN: A World Model for General, Interactable, and Long-Horizon World Simulation. arXiv 2025, arXiv:2511.09057. [Google Scholar] [CrossRef]
Huang, S.; Chen, L.; Zhou, P.; Chen, S.; Liao, Y.; Jiang, Z.; Hu, Y.; Gao, P.; Li, H.; Yao, M.; et al. EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Google Scholar]
Chi, X.; Fan, C.K.; Zhang, H.; Qi, X.; Zhang, R.; Chen, A.; Chan, C.m.; Xue, W.; Liu, Q.; Zhang, S.; et al. Eva: An embodied world model for future video anticipation. arXiv 2024, arXiv:2410.15461. [Google Scholar] [CrossRef]
Mao, X.; Lin, S.; Li, Z.; Li, C.; Peng, W.; He, T.; Pang, J.; Chi, M.; Qiao, Y.; Zhang, K. Yume: An interactive world generation model. arXiv 2025, arXiv:2507.17744. [Google Scholar] [CrossRef]
Zhu, C.; Yu, R.; Feng, S.; Burchfiel, B.; Shah, P.; Gupta, A. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets. arXiv 2025, arXiv:2504.02792. [Google Scholar] [CrossRef]
Zhi, H.; Chen, P.; Zhou, S.; Dong, Y.; Wu, Q.; Han, L.; Tan, M. 3DFlowAction: Learning Cross-Embodiment Manipulation from 3D Flow World Model. arXiv 2025, arXiv:2506.06199. [Google Scholar]
Chen, B.; Martí Monsó, D.; Du, Y.; Simchowitz, M.; Tedrake, R.; Sitzmann, V. Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems 2024, 37, 24081–24125. [Google Scholar]
Zhang, K.; Tang, Z.; Hu, X.; Pan, X.; Guo, X.; Liu, Y.; Huang, J.; Yuan, L.; Zhang, Q.; Long, X.X.; et al. Epona: Autoregressive diffusion world model for autonomous driving. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025; pp. 27220–27230. [Google Scholar]
Alonso, E.; Jelley, A.; Micheli, V.; Kanervisto, A.; Storkey, A.J.; Pearce, T.; Fleuret, F. Diffusion for world modeling: Visual details matter in atari. Advances in Neural Information Processing Systems 2024, 37, 58757–58791. [Google Scholar]
Hassan, M.; Stapf, S.; Rahimi, A.; Rezende, P.; Haghighi, Y.; Brüggemann, D.; Katircioglu, I.; Zhang, L.; Chen, X.; Saha, S.; et al. Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025; pp. 22404–22415. [Google Scholar]
Koh, J.Y.; Lee, H.; Yang, Y.; Baldridge, J.; Anderson, P. Pathdreamer: A world model for indoor navigation. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021; pp. 14738–14748. [Google Scholar]
Tu, Y.; Luo, H.; Chen, X.; Bai, X.; Wang, F.; Zhao, H. PlayerOne: Egocentric World Simulator. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Google Scholar]
Savov, N.; Kazemi, N.; Zhang, D.; Paudel, D.P.; Wang, X.; Gool, L.V. StateSpaceDiffuser: Bringing Long Context to Diffusion World Models. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Google Scholar]
Huang, S.; Wu, J.; Zhou, Q.; Miao, S.; Long, M. Vid2World: Crafting Video Diffusion Models to Interactive World Models. arXiv 2025, arXiv:2505.14357. [Google Scholar] [CrossRef]
Robine, J.; Höftmann, M.; Harmeling, S. Simple, Good, Fast: Self-Supervised World Models Free of Baggage. In Proceedings of the The Thirteenth International Conference on Learning Representations, 2025. [Google Scholar]
Shang, Y.; Zhang, X.; Tang, Y.; Jin, L.; Gao, C.; Wu, W.; Li, Y. RoboScape: Physics-informed Embodied World Model. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Google Scholar]
Li, S.; Yang, C.; Fang, J.; Yi, T.; Lu, J.; Cen, J.; Xie, L.; Shen, W.; Tian, Q. Worldgrow: Generating infinite 3d world. Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence 2026, Vol. 40, 6433–6441. [Google Scholar] [CrossRef]
Zhang, Q.; Zhai, S.; Martin, M.A.B.; Miao, K.; Toshev, A.; Susskind, J.; Gu, J. World-consistent video diffusion with explicit 3d modeling. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025; pp. 21685–21695. [Google Scholar]
Li, X.; Wang, T.; Gu, Z.; Zhang, S.; Guo, C.; Cao, L. FlashWorld: High-quality 3D Scene Generation within Seconds. In Proceedings of the The Fourteenth International Conference on Learning Representations, 2026. [Google Scholar]
Wu, H.; Wu, D.; He, T.; Guo, J.; Ye, Y.; Duan, Y.; Bian, J. Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling. In Proceedings of the The Fourteenth International Conference on Learning Representations, 2026. [Google Scholar]
Lu, Y.; Ren, X.; Yang, J.; Shen, T.; Wu, Z.; Gao, J.; Wang, Y.; Chen, S.; Chen, M.; Fidler, S.; et al. Infinicube: Unbounded and controllable dynamic 3d driving scene generation with world-guided video models. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025; pp. 27272–27283. [Google Scholar]
Yang, Y.; Liu, J.; Zhang, Z.; Zhou, S.; Tan, R.; Yang, J.; Du, Y.; Gan, C. MindJourney: Test-Time Scaling with World Models for Spatial Reasoning. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Google Scholar]
Liang, D.; Zhang, D.; Zhou, X.; Tu, S.; Feng, T.; Li, X.; Zhang, Y.; Du, M.; Tan, X.; Bai, X. Seeing the Future, Perceiving the Future: A unified driving world model for future generation and perception. arXiv 2025, arXiv:2503.13587. [Google Scholar] [CrossRef]
Lee, J.H.; Lin, B.J.; Sun, W.F.; Lee, C.Y. EDELINE: Enhancing Memory in Diffusion-based World Models via Linear-Time Sequence Modeling. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Google Scholar]
Guo, Y.; Shi, L.X.; Chen, J.; Finn, C. Ctrl-World: A Controllable Generative World Model for Robot Manipulation. In Proceedings of the International Conference on Learning Representations (ICLR), 2026. [Google Scholar]
Wu, T.; Yang, S.; Po, R.; Xu, Y.; Liu, Z.; Lin, D.; Wetzstein, G. Video World Models with Long-term Spatial Memory. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Google Scholar]
Xiao, Z.; LAN, Y.; Zhou, Y.; Ouyang, W.; Yang, S.; Zeng, Y.; Pan, X. WorldMem: Long-term Consistent World Simulation with Memory. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Google Scholar]
Wang, G.; Xie, Y.; Jiang, Y.; Mandlekar, A.; Xiao, C.; Zhu, Y.; Fan, L.; Anandkumar, A. Voyager: An Open-Ended Embodied Agent with Large Language Models. Transactions on Machine Learning Research, 2024. [Google Scholar]
Po, R.; Nitzan, Y.; Zhang, R.; Chen, B.; Dao, T.; Shechtman, E.; Wetzstein, G.; Huang, X. Long-context state-space video world models. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025; pp. 8733–8744. [Google Scholar]
Traub, M.; Otte, S.; Menge, T.; Karlbauer, M.; Thuemmel, J.; Butz, M.V. Learning What and Where: Disentangling Location and Identity Tracking Without Supervision. In Proceedings of the The Eleventh International Conference on Learning Representations, 2023. [Google Scholar]
Elsayed, G.; Mahendran, A.; Van Steenkiste, S.; Greff, K.; Mozer, M.C.; Kipf, T. Savi++: Towards end-to-end object-centric learning from real-world videos. Advances in Neural Information Processing Systems 2022, 35, 28940–28954. [Google Scholar]
Zhang, Y.; Guo, X.; Xu, H.; Long, M. Consistent World Models via Foresight Diffusion. arXiv 2025, arXiv:2505.16474. [Google Scholar] [CrossRef]
Chen, J.; Zhu, H.; He, X.; Wang, Y.; Zhou, J.; Chang, W.; Zhou, Y.; Li, Z.; Fu, Z.; Pang, J.; et al. DeepVerse: 4D Autoregressive Video Generation as a World Model. arXiv 2025, arXiv:2506.01103. [Google Scholar] [CrossRef]
Russell, L.; Hu, A.; Bertoni, L.; Fedoseev, G.; Shotton, J.; Arani, E.; Corrado, G. Gaia-2: A controllable multi-view generative world model for autonomous driving. arXiv 2025, arXiv:2503.20523. [Google Scholar] [CrossRef]
Hu, W.; Wen, X.; Li, X.; Wang, G. DSG-World: Learning a 3D Gaussian World Model from Dual State Videos. arXiv 2025, arXiv:2506.05217. [Google Scholar] [CrossRef]
Huang, T.; Zheng, W.; Wang, T.; Liu, Y.; Wang, Z.; Wu, J.; Jiang, J.; Li, H.; Lau, R.; Zuo, W.; et al. Voyager: Long-range and world-consistent video diffusion for explorable 3d scene generation. ACM Transactions on Graphics (TOG) 2025, 44, 1–15. [Google Scholar] [CrossRef]
Zhou, S.; Du, Y.; Yang, Y.; Han, L.; Chen, P.; Yeung, D.Y.; Gan, C. Learning 3D Persistent Embodied World Models. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Google Scholar]
Wang, Z.; Wang, K.; Zhao, L.; Stone, P.; Bian, J. Dyn-O: Building Structured World Models with Object-Centric Representations. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Google Scholar]
Ferraro, S.; Mazzaglia, P.; Verbelen, T.; Dhoedt, B. FOCUS: object-centric world models for robotic manipulation. Frontiers in Neurorobotics 2025, 19, 1585386. [Google Scholar] [CrossRef] [PubMed]
Barcellona, L.; Zadaianchuk, A.; Allegro, D.; Papa, S.; Ghidoni, S.; Gavves, S. Dream to Manipulate: Compositional World Models Empowering Robot Imitation Learning with Imagination. In Proceedings of the Greeks in AI Symposium, 2025, 2025. [Google Scholar]
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. DINOv2: Learning Robust Visual Features without Supervision. Transactions on Machine Learning Research Journal 2024. [Google Scholar]
Huang, Y.; Zhang, J.; Zou, S.; Liu, X.; Hu, R.; Xu, K. LaDi-WM: A Latent Diffusion-Based World Model for Predictive Manipulation. In Proceedings of the Proceedings of The 9th Conference on Robot Learning PMLR; Lim, J., Song, S., Park, H.W., Eds.; Proceedings of Machine Learning Research, 27–30 Sep 2025; Vol. 305, pp. 1726–1743. [Google Scholar]
Guo, J.; Ma, X.; Wang, Y.; Yang, M.; Liu, H.; Li, Q. Flowdreamer: A rgb-d world model with flow-based motion representations for robot manipulation. IEEE Robotics and Automation Letters 2026, 11, 2466–2473. [Google Scholar] [CrossRef]
Bar, A.; Zhou, G.; Tran, D.; Darrell, T.; LeCun, Y. Navigation world models. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025; pp. 15791–15801. [Google Scholar]
Peebles, W.; Xie, S. Scalable diffusion models with transformers. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2023; pp. 4195–4205. [Google Scholar]
Chen, Y.; Li, H.; Jiang, Z.; Wen, H.; Zhao, D. Tevir: Text-to-video reward with diffusion models for efficient reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2025. [Google Scholar]
Rigter, M.; Gupta, T.; Hilmkil, A.; Ma, C. AVID: Adapting Video Diffusion Models to World Models. In Proceedings of the Reinforcement Learning Conference, 2024. [Google Scholar]
Wu, J.; Yin, S.; Feng, N.; He, X.; Li, D.; Hao, J.; Long, M. ivideogpt: Interactive videogpts are scalable world models. Advances in Neural Information Processing Systems 2024, 37, 68082–68119. [Google Scholar]
He, H.; Zhang, Y.; Lin, L.; Xu, Z.; Pan, L. Pre-trained video generative models as world simulators. Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence 2026, Vol. 40, 4645–4653. [Google Scholar] [CrossRef]
Zhao, B.; Tang, R.; Jia, M.; Wang, Z.; Man, F.; Zhang, X.; Shang, Y.; Zhang, W.; Wu, W.; Gao, C.; et al. AirScape: An Aerial Generative World Model with Motion Controllability. In Proceedings of the Proceedings of the 33rd ACM International Conference on Multimedia, 2025; pp. 12519–12528. [Google Scholar]
Robotics, U. UnifoLM-WMA-0: A World-Model-Action (WMA) Framework under UnifoLM Family Open-source world-model–action architecture spanning multiple types of robotic embodiments. 2025. Available online: https://github.com/unitreerobotics/unifolm-world-model-action.
Hayashi, K.; Koyama, M.; Guerreiro, J.J.A. Inter-environmental world modeling for continuous and compositional dynamics. arXiv 2025, arXiv:2503.09911. [Google Scholar] [CrossRef]
Durante, Z.; Gong, R.; Sarkar, B.; Wake, N.; Taori, R.; Tang, P.; Lakshmikanth, S.; Schulman, K.; Milstein, A.; Vo, H.; et al. An interactive agent foundation model. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025; pp. 3652–3662. [Google Scholar]
Zhen, H.; Sun, Q.; Zhang, H.; Li, J.; Zhou, S.; Du, Y.; Gan, C. TesserAct: learning 4D embodied world models. arXiv 2025, arXiv:2504.20995. [Google Scholar] [CrossRef]
Duan, Y.; Zou, Z.; Gu, T.; Jia, W.; Zhao, Z.; Xu, L.; Liu, X.; Lin, Y.; Jiang, H.; Chen, K.; et al. LatticeWorld: A Multimodal Large Language Model-Empowered Framework for Interactive Complex World Generation. arXiv 2025, arXiv:2509.05263. [Google Scholar]
Guo, J.; Ye, Y.; He, T.; Wu, H.; Jiang, Y.; Pearce, T.; Bian, J. Mineworld: a real-time and open-source interactive world model on minecraft. arXiv 2025, arXiv:2504.08388. [Google Scholar]
Li, J.; Tang, J.; Xu, Z.; Wu, L.; Zhou, Y.; Shao, S.; Yu, T.; Cao, Z.; Lu, Q. Hunyuan-GameCraft: High-dynamic Interactive Game Video Generation with Hybrid History Condition. arXiv 2025, arXiv:2506.17201. [Google Scholar]
Lab, D. Mirage 2 — Generative World Engine Browser-based system to generate explorable 3D worlds from images/text. 2025. Available online: https://www.mirage2.org/.
Labs, W. World Labs: spatial intelligence for large world models. 2025. Available online: https://www.worldlabs.ai/ (accessed on 2025-06-05).
Yang, Z.; Ge, W.; Li, Y.; Chen, J.; Li, H.; An, M.; Kang, F.; Xue, H.; Xu, B.; Yin, Y.; et al. Matrix-3d: Omnidirectional explorable 3d world generation. arXiv 2025, arXiv:2508.08086. [Google Scholar] [CrossRef]
Jang, J.; Ye, S.; Lin, Z.; Xiang, J.; Bjorck, J.; Fang, Y.; Hu, F.; Huang, S.; Kundalia, K.; Lin, Y.C.; et al. DreamGen: Unlocking Generalization in Robot Learning through Video World Models. In Proceedings of the 9th Annual Conference on Robot Learning, 2025. [Google Scholar]
Kerbl, B.; Kopanas, G.; Leimkühler, T.; Drettakis, G. 3D Gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 2023, 42, 139–1. [Google Scholar] [CrossRef]
Zheng, R.; Wang, J.; Reed, S.; Bjorck, J.; Fang, Y.; Hu, F.; Jang, J.; Kundalia, K.; Lin, Z.; Magne, L.; et al. FLARE: Robot Learning with Implicit World Modeling. In Proceedings of the Proceedings of The 9th Conference on Robot Learning;PMLR; Lim, J., Song, S., Park, H.W., Eds.; Proceedings of Machine Learning Research, 27–30 Sep 2025; Vol. 305, pp. 3952–3971. [Google Scholar]
Qiu, Y.; Ziser, Y.; Korhonen, A.; Cohen, S.B.; Ponti, E.M. Bootstrapping World Models from Dynamics Models in Multimodal Foundation Models. arXiv 2025, arXiv:2506.06006. [Google Scholar] [CrossRef]
Kadian, A.; Truong, J.; Gokaslan, A.; Clegg, A.; Wijmans, E.; Lee, S.; Savva, M.; Chernova, S.; Batra, D. Sim2real predictivity: Does evaluation in simulation predict real-world performance? IEEE Robotics and Automation Letters 2020, 5, 6670–6677. [Google Scholar] [CrossRef]
Li, X.; Song, R.; Xie, Q.; Wu, Y.; Zeng, N.; Ai, Y. Simworld: A unified benchmark for simulator-conditioned scene generation via world model. In Proceedings of the 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); IEEE, 2025; pp. 927–934. [Google Scholar]
Agarwal, N.; Ali, A.; Bala, M.; Balaji, Y.; Barker, E.; Cai, T.; Chattopadhyay, P.; Chen, Y.; Cui, Y.; Ding, Y.; et al. Cosmos world foundation model platform for physical ai. arXiv 2025, arXiv:2501.03575. [Google Scholar] [CrossRef]
Mo, S.; Leng, Z.; Liu, L.; Wang, W.; He, H.; Zhou, B. Dreamland: Controllable World Creation with Simulator and Generative Models. arXiv 2025, arXiv:2506.08006. [Google Scholar] [CrossRef]
Feng, Y.; Tan, H.; Mao, X.; Liu, G.; Huang, S.; Xiang, C.; Su, H.; Zhu, J. Vidar: Embodied video diffusion model for generalist bimanual manipulation. arXiv 2025, arXiv:2507.12898. [Google Scholar]
Wang, Y.; Yu, R.; Wan, S.; Gan, L.; Zhan, D.C. Founder: Grounding foundation models in world models for open-ended embodied decision making. In Proceedings of the Forty-second International Conference on Machine Learning, 2025. [Google Scholar]
Zhou, G.; Pan, H.; LeCun, Y.; Pinto, L. DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning. In Proceedings of the Forty-second International Conference on Machine Learning, 2025. [Google Scholar]
Wang, X.; Zhu, Z.; Huang, G.; Wang, B.; Chen, X.; Lu, J. Worlddreamer: Towards general world models for video generation via predicting masked tokens. arXiv 2024, arXiv:2401.09985. [Google Scholar] [CrossRef]
Schiewer, R.; Subramoney, A.; Wiskott, L. Exploring the limits of hierarchical world models in reinforcement learning. Scientific Reports 2024, 14, 26856. [Google Scholar] [CrossRef]
Hao, C.; Lu, W.; Xu, Y.; Chen, Y. Neural Motion Simulator Pushing the Limit of World Models in Reinforcement Learning. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025; pp. 27608–27617. [Google Scholar]
Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-based generative modeling through stochastic differential equations. arXiv 2020, arXiv:2011.13456. [Google Scholar]
Mazzaglia, P.; Verbelen, T.; Dhoedt, B.; Courville, A.; Rajeswar, S. GenRL: Multimodal-foundation world models for generalization in embodied agents. Advances in neural information processing systems 2024, 37, 27529–27555. [Google Scholar]
Fang, F.; Liang, W.; Wu, Y.; Xu, Q.; Lim, J.H. Improving generalization of reinforcement learning using a bilinear policy network. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP); IEEE, 2022; pp. 991–995. [Google Scholar]
Fang, Q.; Du, W.; Wang, H.; Zhang, J. Towards Unraveling and Improving Generalization in World Models. arXiv 2024, arXiv:2501.00195. [Google Scholar] [CrossRef]
Gao, S.; Zhou, S.; Du, Y.; Zhang, J.; Gan, C. AdaWorld: Learning Adaptable World Models with Latent Actions. In Proceedings of the Forty-second International Conference on Machine Learning, 2025. [Google Scholar]
Prasanna, S.; Farid, K.; Rajan, R.; Biedenkapp, A. Dreaming of Many Worlds: Learning Contextual World Models aids Zero-Shot Generalization. In Proceedings of the Seventeenth European Workshop on Reinforcement Learning, 2024. [Google Scholar]
Baldassarre, F.; Szafraniec, M.; Terver, B.; Khalidov, V.; Massa, F.; LeCun, Y.; Labatut, P.; Seitzer, M.; Bojanowski, P. Back to the features: Dino as a foundation for video world models. arXiv 2025, arXiv:2507.19468. [Google Scholar] [CrossRef]
Ali, M.Q.; Sridhar, A.; Matiana, S.; Wong, A.; Al-Sharman, M. Humanoid World Models: Open World Foundation Models for Humanoid Robotics. arXiv 2025, arXiv:2506.01182. [Google Scholar] [CrossRef]
Chi, X.; Ge, K.; Liu, J.; Zhou, S.; Jia, P.; He, Z.; Liu, Y.; Li, T.; Han, L.; Han, S.; et al. MinD: Unified Visual Imagination and Control via Hierarchical World Models. arXiv 2025, arXiv:2506.18897. [Google Scholar] [CrossRef]
Chun, J.; Jeong, Y.; Kim, T. Sparse Imagination for Efficient Visual World Model Planning. In Proceedings of the The Fourteenth International Conference on Learning Representations, 2026. [Google Scholar]
Cohen, L.; Wang, K.; Kang, B.; Gadot, U.; Mannor, S. Uncovering Untapped Potential in Sample-Efficient World Model Agents. arXiv 2025, arXiv:2502.11537. [Google Scholar]
Burchi, M.; Timofte, R. Accurate and Efficient World Modeling with Masked Latent Transformers. In Proceedings of the Proceedings of the 42nd International Conference on Machine Learning PMLR; Singh, A., Fazel, M., Hsu, D., Lacoste-Julien, S., Berkenkamp, F., Maharaj, T., Wagstaff, K., Zhu, J., Eds.; Proceedings of Machine Learning Research, 13–19 Jul 2025; Vol. 267, pp. 5894–5912. [Google Scholar]
Zhang, H.; Yan, X.; Xue, Y.; Guo, Z.; Cui, S.; Li, Z.; Liu, B. D²-world: An Efficient World Model through Decoupled Dynamic Flow. arXiv 2024, arXiv:2411.17027. [Google Scholar]
Pu, Y.; Niu, Y.; Tang, J.; Xiong, J.; Hu, S.; Li, H. One Model for All Tasks: Leveraging Efficient World Models in Multi-Task Planning. arXiv 2025, arXiv:2509.07945. [Google Scholar] [CrossRef]
Li, S.; Hao, Q.; Shang, Y.; Li, Y. KeyWorld: Key Frame Reasoning Enables Effective and Efficient World Models. arXiv 2025, arXiv:2509.21027. [Google Scholar] [CrossRef]
Yamada, J.; Rigter, M.; Collins, J.; Posner, I. Twist: Teacher-student world model distillation for efficient sim-to-real transfer. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA); IEEE, 2024; pp. 9190–9196. [Google Scholar]
Micheli, V.; Alonso, E.; Fleuret, F. Transformers are Sample-Efficient World Models. In Proceedings of the The Eleventh International Conference on Learning Representations, 2023. [Google Scholar]
Micheli, V.; Alonso, E.; Fleuret, F. Efficient World Models with Context-Aware Tokenization. In Proceedings of the International Conference on Machine Learning. PMLR, 2024; pp. 35623–35638. [Google Scholar]
Song, Q.; Wang, X.; Zhou, D.; Lin, J.; Chen, C.; Ma, Y.; Li, X. Hero: Hierarchical extrapolation and refresh for efficient world models. arXiv 2025, arXiv:2508.17588. [Google Scholar] [CrossRef]
Jin, B.; Li, W.; Yang, B.; Zhu, Z.; Jiang, J.; Gao, H.a.; Sun, H.; Zhan, K.; Hu, H.; Zhang, X.; et al. PosePilot: Steering camera pose for generative world models with self-supervised depth. In Proceedings of the 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025; pp. 8051–8058. [Google Scholar]
Jeong, Y.; Chun, J.; Cha, S.; Kim, T. Object-Centric World Model for Language-Guided Manipulation. In Proceedings of the ICLR 2025 Workshop on World Models: Understanding, Modelling and Scaling, 2025. [Google Scholar]
Akbulut, T.; Merlin, M.; Parr, S.; Quartey, B.; Thompson, S. Sample Efficient Robot Learning with Structured World Models. arXiv 2022, arXiv:2210.12278. [Google Scholar] [CrossRef]
Van Den Oord, A.; Vinyals, O.; et al. Neural discrete representation learning. Advances in neural information processing systems 2017, 30. [Google Scholar]
Chen, R.; Ko, Y.; Zhang, Z.; Cho, C.; Chung, S.; Giuffré, M.; Shung, D.L.; Stadie, B.C. LAMP: Extracting Locally Linear Decision Surfaces from LLM World Models. arXiv 2025, arXiv:2505.11772. [Google Scholar] [CrossRef]
Hu, W.; Wang, H.; Wang, J. Joint Optimization Time-Slotted Computing Offloading and V2X Resource Allocation by Reinforcement Learning. In IEEE Transactions on Systems, Man, and Cybernetics: Systems; 2026. [Google Scholar]
Zeng, B.; Zhu, K.; Hua, D.; Li, B.; Tong, C.; Wang, Y.; Huang, X.; Dai, Y.; Zhang, Z.; Yang, Y.; et al. Research on World Models Is Not Merely Injecting World Knowledge into Specific Tasks. arXiv 2026, arXiv:2602.01630. [Google Scholar] [CrossRef]
Xiang, J.; Liu, G.; Gu, Y.; Gao, Q.; Ning, Y.; Zha, Y.; Feng, Z.; Tao, T.; Hao, S.; Shi, Y.; et al. Pandora: Towards general world model with natural language actions and video states. arXiv 2024, arXiv:2406.09455. [Google Scholar]
Ge, Z.; Huang, H.; Zhou, M.; Li, J.; Wang, G.; Tang, S.; Zhuang, Y. Worldgpt: Empowering llm as multimodal world model. In Proceedings of the Proceedings of the 32nd ACM International Conference on Multimedia, 2024; pp. 7346–7355. [Google Scholar]
Cherian, A.; Corcodel, R.; Jain, S.; Romeres, D. Llmphy: Complex physical reasoning using large language models and world models. arXiv 2024, arXiv:2411.08027. [Google Scholar] [CrossRef]
Yang, Z.; Guo, X.; Ding, C.; Wang, C.; Wu, W. Physical informed driving world model. arXiv 2024, arXiv:2412.08410. [Google Scholar] [CrossRef]
Jiang, H.; Hsu, H.Y.; Zhang, K.; Yu, H.N.; Wang, S.; Li, Y. Phystwin: Physics-informed reconstruction and simulation of deformable objects from videos. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025; pp. 7219–7230. [Google Scholar]
Li, J.; Wan, H.; Lin, N.; Zhan, Y.L.; Chengze, R.; Wang, H.; Zhang, Y.; Liu, H.; Wang, Z.; Yu, F.; et al. SlotPi: Physics-informed Object-centric Reasoning Models. Proceedings of the Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2(2025), 1376–1387.
Petri, F.; Asprino, L.; Gangemi, A. Learning Local Causal World Models with State Space Models and Attention. arXiv 2025, arXiv:2505.02074. [Google Scholar] [CrossRef]
Yan, Z.; Dong, W.; Shao, Y.; Lu, Y.; Liu, H.; Liu, J.; Wang, H.; Wang, Z.; Wang, Y.; Remondino, F.; et al. Renderworld: World model with self-supervised 3d label. In Proceedings of the 2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025; pp. 6063–6070. [Google Scholar]
Zhou, X.; Liang, D.; Tu, S.; Chen, X.; Ding, Y.; Zhang, D.; Tan, F.; Zhao, H.; Bai, X. Hermes: A unified self-driving world model for simultaneous 3d scene understanding and generation. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025; pp. 27817–27827. [Google Scholar]
Wu, J.; Ma, H.; Deng, C.; Long, M. Pre-training contextualized world models with in-the-wild videos for reinforcement learning. Advances in Neural Information Processing Systems 2023, 36, 39719–39743. [Google Scholar]
Wang, Q.; Zhang, Z.; Xie, B.; Jin, X.; Wang, Y.; Wang, S.; Zheng, L.; Yang, X.; Zeng, W. Disentangled world models: Learning to transfer semantic knowledge from distracting videos for reinforcement learning. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025; pp. 2599–2608. [Google Scholar]
Zhang, W.; Jelley, A.; McInroe, T.; Storkey, A. Objects matter: object-centric world models improve reinforcement learning in visually complex environments. arXiv 2025, arXiv:2501.16443. [Google Scholar] [CrossRef]
Wang, Y.; Wan, S.; Gan, L.; Feng, S.; Zhan, D.C. AD3: implicit action is the key for world models to distinguish the diverse visual distractors. In Proceedings of the Proceedings of the 41st International Conference on Machine Learning, 2024; pp. 51546–51568. [Google Scholar]
Wang, X.; Wu, Z.; Peng, P. LongDWM: Cross-Granularity Distillation for Building a Long-Term Driving World Model. arXiv 2025, arXiv:2506.01546. [Google Scholar]
Jiang, J.; Janghorbani, S.; De Melo, G.; Ahn, S. SCALOR: Generative World Models with Scalable Object Representations. In Proceedings of the International Conference on Learning Representations, 2020. [Google Scholar]
Zhu, H.; Wang, Y.; Zhou, J.; Chang, W.; Zhou, Y.; Li, Z.; Chen, J.; Shen, C.; Pang, J.; He, T. Aether: Geometric-aware unified world modeling. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025; pp. 8535–8546. [Google Scholar]
Lu, G.; Jia, B.; Li, P.; Chen, Y.; Wang, Z.; Tang, Y.; Huang, S. Gwm: Towards scalable gaussian world models for robotic manipulation. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025; pp. 9263–9274. [Google Scholar]
Mao, Z.; Ruchkin, I. Towards Physically Interpretable World Models: Meaningful Weakly Supervised Representations for Visual Trajectory Prediction. arXiv 2024, arXiv:2412.12870. [Google Scholar] [CrossRef]
Ross, E.; Drygala, C.; Schwarz, L.; Kaiser, S.; di Mare, F.; Breiten, T.; Gottschalk, H. When do World Models Successfully Learn Dynamical Systems? arXiv 2025, arXiv:2507.04898. [Google Scholar] [CrossRef]
Zhou, S.; Zhou, T.; Yang, Y.; Long, G.; Ye, D.; Jiang, J.; Zhang, C. WALL-E: World Alignment by NeuroSymbolic Learning improves World Model-based LLM Agents. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Google Scholar]
Liu, X.; Tang, H. FOLIAGE: Towards Physical Intelligence World Models Via Unbounded Surface Evolution. arXiv 2025, arXiv:2506.03173. [Google Scholar]
Huh, M.; Cheung, B.; Wang, T.; Isola, P. The platonic representation hypothesis. arXiv 2024, arXiv:2405.07987. [Google Scholar] [CrossRef]
Zhao, Y.; Scannell, A.; Hou, Y.; Cui, T.; Chen, L.; Büchler, D.; Solin, A.; Kannala, J.; Pajarinen, J. Generalist World Model Pre-Training for Efficient Reinforcement Learning. In Proceedings of the ICLR 2025 Workshop on World Models: Understanding, Modelling and Scaling, 2025. [Google Scholar]
Technologies, X. 1X World Model. 2024. Available online: https://www.1x.tech/discover/1x-world-model (accessed on 2025-11-14).
Guo, W.; Liu, G.; Zhou, Z.; Wang, J.; Tang, Y.; Wang, M. Robust training in multiagent deep reinforcement learning against optimal adversary. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2025. [Google Scholar]
Chen, C.; Wu, Y.F.; Yoon, J.; Ahn, S. Transdreamer: Reinforcement learning with transformer world models. arXiv 2022, arXiv:2202.09481. [Google Scholar] [CrossRef]
Wu, H.; Guo, M.; Li, Z.; Dou, Z.; Long, M.; He, K.; Matusik, W. GeoPT: Scaling Physics Simulation via Lifted Geometric Pre-Training. arXiv 2026, arXiv:2602.20399. [Google Scholar] [CrossRef]
Jin, C.; Lin, M.; Wu, F.; Wu, X.; Zhou, Y.; Wang, J. TVMTrailer: a text-video-music AIGC framework for film trailer generation. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2025. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of the First conference on language modeling, 2024. [Google Scholar]
Ye, S.; Ge, Y.; Zheng, K.; Gao, S.; Yu, S.; Kurian, G.; Indupuru, S.; Tan, Y.L.; Zhu, C.; Xiang, J.; et al. World Action Models are Zero-shot Policies. arXiv 2026, arXiv:2602.15922. [Google Scholar] [CrossRef]
Gao, S.; Liang, W.; Zheng, K.; Malik, A.; Ye, S.; Yu, S.; Tseng, W.C.; Dong, Y.; Mo, K.; Lin, C.H.; et al. DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos. arXiv 2026, arXiv:2602.06949. [Google Scholar]
Kotar, K.; Lee, W.; Venkatesh, R.; Chen, H.; Bear, D.; Watrous, J.; Kim, S.; Aw, K.L.; Chen, L.N.; Stojanov, S.; et al. World modeling with probabilistic structure integration. arXiv 2025, arXiv:2509.09737. [Google Scholar] [CrossRef]
Saxena, D.; Cao, J. Generative adversarial networks (GANs) challenges, solutions, and future directions. ACM Computing Surveys (CSUR) 2021, 54, 1–42. [Google Scholar] [CrossRef]
Wang, M.; Jin, W.; Cao, K.; Xie, L.; Hong, Y. ContactGaussian-WM: Learning Physics-Grounded World Model from Videos. arXiv 2026, arXiv:2602.11021. [Google Scholar]
Henderson, P.; Chang, W.D.; Shkurti, F.; Hansen, J.; Meger, D.; Dudek, G. Benchmark environments for multitask learning in continuous domains. arXiv 2017, arXiv:1708.04352. [Google Scholar] [CrossRef]
NVIDIA PhysX SDK. 2025. Available online: https://developer.nvidia.com/physx-sdk (accessed on 2025-11-15).
Makoviychuk, V.; Wawrzyniak, L.; Guo, Y.; Lu, M.; Storey, K.; Macklin, M.; Hoeller, D.; Rudin, N.; Allshire, A.; Handa, A. Isaac Gym: High Performance GPU Based Physics Simulation For Robot Learning NeurIPS 2021 Track: Datasets and Benchmarks. arXiv 2021, arXiv:2110.13563. [Google Scholar]
Sun, P.; Kretzschmar, H.; Dotiwalla, X.; Chouard, A.; Patnaik, V.; Tsui, P.; Guo, J.; Zhou, Y.; Chai, Y.; Caine, B.; et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020; pp. 2446–2454. [Google Scholar]
Wang, Y.; Yan, H.; Park, J.H.; Hu, Y.; Shen, H. Asynchronous control of cyber–physical systems with Quantized measurements and stochastic multimode attacks. IEEE Transactions on Cybernetics, 2025. [Google Scholar]
Duan, H.; Yu, H.X.; Chen, S.; Fei-Fei, L.; Wu, J. Worldscore: A unified evaluation benchmark for world generation. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025; pp. 27713–27724. [Google Scholar]
Li, D.; Fang, Y.; Chen, Y.; Yang, S.; Cao, S.; Wong, J.; Luo, M.; Wang, X.; Yin, H.; Gonzalez, J.E.; et al. WorldModelBench: Judging Video Generation Models As World Models. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. [Google Scholar]
Bengio, Y.; Clare, S.; Prunkl, C.; Andriushchenko, M.; Bucknall, B.; Murray, M.; Bommasani, R.; Casper, S.; Davidson, T.; Douglas, R.; et al. International ai safety report 2026. arXiv 2026, arXiv:2602.21012. [Google Scholar] [CrossRef]

Figure 2. Since the concept of World Models was first introduced in 2018, advancements in data availability, computational power, and model scalability have significantly accelerated progress in this field. By 2025, world model research has reached a historical milestone, with nearly a hundred academic papers published within a single year. Nevertheless, learning world models from videos continues to face five key challenges that remain unresolved.

Figure 3. The data engine pipeline of EnerVerse [26] operates as follows: multi-camera observation images and anchor views are processed by a multi-view video generator to produce denoised multi-view videos. Combined with camera pose inputs, these videos are fed into 4DGS for four-dimensional scene reconstruction. The reconstructed results are then rendered into high-fidelity anchor images and iteratively refined through feedback, improving motion consistency and reconstruction accuracy, ultimately achieving geometrically consistent and high-definition outputs.

Figure 4. The unified world model framework advocated by Yue et al. [124], featuring interaction, reasoning, memory, and multimodal generation.

Table 1. A detailed summary of the physical challenges encountered when learning world models from videos.

Challenges	Importance	Classes	The Pixel–Physics Gap	Solutions
Continuity	Physical continuity is fundamental to stable world models. Maintaining continuity across time, space, and object identity is crucial for reliable long-term prediction and decision-making.	Temporal (Section 2.1.1)	Videos sample continuous processes as discrete frames and lack causality, causing autoregressive models to accumulate errors and break physical continuity.	Autoregression improvements
				Optimization of schedules
				Conditional constraints
				Optimization-level
		Spatial (Section 2.1.2)	Videos are 2D projections that lose key 3D geometry (e.g., depth and structure)	Implicit alignment
				Explicit alignment
				Memory mechanisms
		Identity (Section 2.1.3)	Videos lack explicit object properties (e.g., mass, material, shape).	Identity perception
Controllability	Videos passively record events without action–outcome causality, limiting world models in controllability, causal reasoning, and goal-directed behavior.	Semantic (Section 2.2.1)	Videos lack semantic–physical alignment, hindering grounded instruction.	Semantic alignment mechanism
		Interactivity (Section 2.2.2)	Videos record past events and cannot model counterfactuals or real-time responses to interventions.	Low-level control signals
				Global control
Generalization	Videos capture appearance rather than underlying physics, leading to overfitting. Generalization requires learning physics-grounded representations transferable across scenes and tasks.	Data Augmentation (Section 2.3.1)	Physically diverse, well-annotated video data are scarce, limiting coverage of rare dynamics.	Automated data pipeline
				Simulation-to-real engine
				Foundation models
		Architecture (Section 2.3.2)	Naive architectures capture visual correlations rather than physical invariances.	Advanced modeling
		Behavioral & Environmental (Section 2.3.3)	Videos from different embodiments and environments exhibit large distribution shifts in physical dynamics.	Regularization and disentangling
				Context adaptation
Lightweight	Lightweight models save resources, enable real-time interaction, and learn compact, generalizable dynamics, balancing efficiency and performance for constrained settings.	Representation & Efficiency (Section 2.4)	Videos are high-dimensional and redundant, with much irrelevant pixel data, hindering efficient extraction of physically meaningful representations.	Latent modeling
				Sequence optimization
				Parameter-efficient / transfer
				Training and sample efficiency
Universality	Videos capture a limited slice of reality and lack unified cross-modal representations. Universal world models need shared physical abstractions that generalize.	Modeling Multi-task Architecture (Section 2.5)	Videos’ passive, single-view, and modality-limited nature prevents learning unified physical representations.	Shared representations
				Multi-task learning
				Knowledge generalization and transfer

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Learning the Physical World from Videos: A Prospective Study on World Models

Abstract

Keywords:

Subject:

1. Introduction

2. Challenges of Learning from Video

2.1. Physical Continuity

2.1.1. Temporal Continuity

2.1.2. Spatial Continuity

2.1.3. Identity Continuity

2.2. Controllability

2.2.1. Semantic Control

2.2.2. Interactivity

2.3. Generalization

2.3.1. Data

2.3.2. Architecture Generalization

2.3.3. Behavioral and Environmental Generalization

2.4. Lightweight

2.5. Universality

3. Three Paradigm of Physical World Model

3.1. Learning from Physical Priors

3.2. Learning from Action Decoupling

3.3. Hierarchical Progressive Learning

4. Future Directions and Discussion

4.1. Open Challenges

4.2. Industrialization and Deployment

4.3. Safety and Ethical Challenges

5. Conclusions

Conflicts of Interest

References

MDPI Initiatives

Important Links

Subscribe