Submitted:
07 April 2026
Posted:
08 April 2026
Read the latest preprint version here
Abstract
Keywords:
1. Introduction

- Pixel–Physics Challenges: We distill five core challenges—continuity, controllability, generalization, lightweight, and universality—and systematically summarize the sub-problems and representative solutions for each challenge through the lens of physical consistency.
- Three Paradigms of Physical World Model: Existing approaches toward physically grounded modeling can be broadly categorized into three classes: prior injection, dynamic–static decoupling, and hierarchical abstraction.
- Benchmarks: Existing evaluation benchmarks are systematically reviewed, with an emphasis on assessment frameworks related to physical perception and dynamics prediction.
- Future Directions: We identify five major open problems for future research and provide a systematic discussion on industrial deployment and current safety issues.
2. Challenges of Learning from Video
2.1. Physical Continuity
2.1.1. Temporal Continuity

| Challenges | Importance | Classes | The Pixel–Physics Gap | Solutions |
|---|---|---|---|---|
![]() Continuity |
Physical continuity is fundamental to stable world models. Maintaining continuity across time, space, and object identity is crucial for reliable long-term prediction and decision-making.
|
Temporal (Section 2.1.1) |
Videos sample continuous processes as discrete frames and lack explicit causality, causing autoregressive models to accumulate temporal errors and break physical continuity.
|
Autoregression improvements |
Optimization of schedules | ||||
Conditional constraints | ||||
Optimization-level | ||||
| Spatial (Section 2.1.2) |
Videos are 2D projections that lose key 3D geometry (e.g., depth and structure)
|
Implicit alignment |
||
Explicit alignment | ||||
Memory mechanisms | ||||
| Identity (Section 2.1.3) |
Videos lack explicit object properties (e.g., mass, material, shape).
|
Identity perception |
||
![]() Controllability |
Videos passively record events without action–outcome causality, limiting world models in controllability, causal reasoning, and goal-directed behavior.
|
Semantic (Section 2.2.1) |
Videos lack semantic–physical alignment, hindering grounded instruction.
|
Semantic alignment mechanism |
| Interactivity (Section 2.2.2) |
Videos record past events and cannot model counterfactuals or real-time responses to interventions.
|
Low-level control signals |
||
Global control | ||||
![]() Generalization |
Videos capture appearance rather than underlying physics, leading to overfitting. Generalization requires learning physics-grounded representations transferable across scenes and tasks.
|
Data Augmentation (Section 2.3.1) |
Physically diverse, well-annotated video data are scarce, limiting coverage of rare dynamics.
|
Automated data pipeline |
Simulation-to-real engine | ||||
Foundation models | ||||
| Architecture (Section 2.3.2) |
Naive architectures capture visual correlations rather than physical invariances.
|
Advanced modeling |
||
| Behavioral & Environmental (Section 2.3.3) |
Videos from different embodiments and environments exhibit large distribution shifts in physical dynamics.
|
Regularization & disentangling |
||
Context adaptation | ||||
![]() Lightweight |
Lightweight models save resources, enable real-time interaction, and learn compact, generalizable dynamics, balancing efficiency and performance for constrained settings.
|
Representation & Efficiency (Section 2.4) |
Videos are high-dimensional and redundant, with much irrelevant pixel data, hindering efficient extraction of physically meaningful representations.
|
Latent modeling |
Sequence optimization | ||||
Parameter-efficient / transfer | ||||
Training and sample efficiency | ||||
![]() Universality |
Videos capture only a limited slice of reality and lack unified cross-modal representations. Universal world models need shared physical abstractions that generalize broadly.
|
Modeling Multi-task Architecture (Section 2.5) |
Videos’ passive, single-view, and modality-limited nature prevents learning unified physical representations.
|
Shared representations |
Multi-task learning | ||||
Knowledge generalization and transfer |
indicates video or multi-frame;
denotes single-frame or image;
refers to latent
representation;
to spatial representation;
to text modality;
to action representation;
to
camera (pose) information;
to depth feature;
to neural signal;
to object-level feature;
to
physical information; and
to historical memory feature. Evaluations: V assesses visual quality,
generation, prediction, and control in downstream tasks, encompassing qualitative evaluations. R
evaluates robotic tasks in simulation, while includes real-robot experiments. P measures physical
understanding and perception. C evaluates planning and decision-making in games, tasks, and
navigation.
indicates video or multi-frame;
denotes single-frame or image;
refers to latent
representation;
to spatial representation;
to text modality;
to action representation;
to
camera (pose) information;
to depth feature;
to neural signal;
to object-level feature;
to
physical information; and
to historical memory feature. Evaluations: V assesses visual quality,
generation, prediction, and control in downstream tasks, encompassing qualitative evaluations. R
evaluates robotic tasks in simulation, while includes real-robot experiments. P measures physical
understanding and perception. C evaluates planning and decision-making in games, tasks, and
navigation.| Work | Venue | Main Solution | Method | Input / Output | Conditions | Evals |
|---|---|---|---|---|---|---|
| Temporal Consistency | ||||||
| EnerVerse [25] | NeurIPS’25 | Autoregression Improvements |
Sparse Chunks |
/
|
![]() ![]() ![]() ![]()
|
V |
| EVA [26] | arXiv’25 | Reflection |
/
|
![]()
|
V | |
| Yume [27] | arXiv’25 | Framepack |
/
|
![]() ![]() ![]()
|
V | |
| SAMPO [23] | NeurIPS’25 | Scale-Wise |
![]() ![]() /
|
None | V | |
| Yan [33] | arXiv’25 | Diffusion Schedules |
Progressive Noise |
/
|
![]()
|
V |
| GEM [34] | CVPR’25 | Increasing Noise |
![]() ![]() /![]()
|
![]()
|
V | |
| Diamond [32] | NeurIPS’24 | Adaptive Noise |
![]() ![]() /
|
None | V | |
| Epona [31] | ICCV’25 | Diffusion Forcing |
/
|
![]() |
VC | |
| Pathdreamer [35] | ICCV’21 | Condition Constraints |
HR. Modeling |
/
|
![]()
|
VC |
| PlayerOne [36] | NeurIPS’25 | Rec. Constraints |
![]() ![]() ![]() /![]()
|
![]() |
V | |
| VRAG [22] | NeurIPS’25 | Global Constraints |
![]() /
|
None | CV | |
| Vid2World [38] | ICLR’26 | Optimization Level |
Loss-based |
![]() /![]()
|
![]() |
V |
| SSD [37] | NeurIPS’25 | State-space |
![]() /
|
None | C | |
| SGF [39] | ICLR’25 | Regularization |
![]() /
|
None | C | |
| Emu3.5 [40] | arXiv’25 | Pre-training |
![]() /
|
None | V | |
| Spatial Consistency | ||||||
| RoboScape [41] | NeurIPS’25 | Implicit Representation Alignment |
HR. Modeling |
![]() ![]() /![]() ![]()
|
None | VR |
| ManipDreamer [42] | arXiv’25 | Action Tree |
/
|
![]() ![]()
|
VR | |
| WorldGrow [43] | AAAI’26 | Block Inpainting |
/
|
![]() ![]()
|
V | |
| DeepVerse [21] | arXiv’25 | Structural Alignment |
![]() ![]() /![]() ![]()
|
![]() ![]()
|
V | |
| GAIA-2 [44] | arXiv’25 | Semantic Alignment |
/
|
![]() ![]()
|
VC | |
| WVD [45] | CVPR’25 | Explicit Representation Alignment |
Spatial Joint Modeling |
![]() /![]()
|
None | V |
| FlashWorld [46] | ICLR’26 | Dual-mode Pre-training |
/
|
![]() ![]()
|
V | |
| Geom. Forcing [47] | ICLR’26 | Rep. Alignment |
/
|
None | V | |
| InfiniCube [48] | ICCV’25 | HR. Constraints |
/
|
![]() ![]()
|
V | |
| MindJourney [49] | NeurIPS’25 | Language Guidance |
![]() /![]()
|
![]()
|
VCP | |
| UniFuture [50] | ICRA’26 | Multi-modal |
![]() /![]()
|
![]() |
V | |
| Edeline [51] | NeurIPS’25 | Mem. Enhancement |
![]() /
|
None | C | |
| Ctrl-World [52] | ICLR’26 | Space Constraints |
![]() /![]()
|
![]() ![]()
|
V | |
| Spatial-Mem [53] | NeurIPS’25 | Semantic Alignment |
![]() /
|
![]()
|
V | |
| WorldMEM [54] | NeurIPS’25 | Memory Mechanism |
Memory Bank |
/
|
![]() ![]()
|
V |
| Voyager(LLM) [55] | TMLR’24 | Skill Library |
/
|
![]()
|
C | |
| SSM-World [56] | ICCV’25 | State-Space Models |
/
|
![]() |
V | |
| Mem. Forcing [57] | arXiv’25 | Memory Replay |
![]() /
|
![]()
|
V | |
| Identity Consistency | ||||||
| SSWM [58] | arXiv’24 | Attention | Semantic Alignment |
/
|
![]()
|
P |
| Loci-v1 [59] | ICLR’23 | Occlusion | Imagination Tracking |
/
|
![]() |
C |
| SAVi++ [60] | NeurIPS’22 | Tracking | Identity Tracking |
/![]()
|
None | P |
| ForeDiff [61] | arXiv’25 | Anchors | Arch. Decoupling |
/
|
![]()
|
V |
2.1.2. Spatial Continuity
2.1.3. Identity Continuity
2.2. Controllability
2.2.1. Semantic Control
2.2.2. Interactivity
2.3. Generalization
2.3.1. Data
2.3.2. Architecture Generalization
2.3.3. Behavioral and Environmental Generalization
| Work | Venue | GPUs | Batch Size | Training Steps |
|---|---|---|---|---|
| DINO-world [107] | arXiv’25 | H100*16 | 1024 | 350K Iter. |
| HWM [108] | arXiv’25 | A6000*2 | 128 | – |
| MinD [109] | arXiv’25 | A40*4 | – | 9 Hours |
| Sparse Imagin. [110] | ICLR’26 | 3090*4 | 32 | 100 Epochs |
| Simulus [111] | arXiv’25 | 4090*1 | 8 | 100 Epochs |
| EMERALD [112] | ICML’25 | 3090*1 | 16 | – |
| D2-World [113] | arXiv’24 | V100*8 | 24 | 24 Epochs |
| AVID [73] | RLC’25 | A100*4 | 64 | 7 Days |
| ScaleZero [114] | arXiv’25 | A100*8 | 512 | – |
| KeyWorld [115] | arXiv’25 | A800*8 | 1 | 100 Epochs |
| TWIST [116] | ICRA’24 | 3090*1 | – | 500K Iter. |
| IRIS [117] | ICLR’23 | A100*8 | 256 | 3.5 Days |
| -IRIS [118] | ICML’24 | A100*1 | 32 | 1K Epochs |
| HERO [119] | arXiv’25 | A100*1 | – | – |
| PosePilot [120] | IROS’25 | A100*8 | - | – |
| OCWM [121] | ICLR’25 | H100*4 | 32 | 40 Epochs |
2.4. Lightweight
indicates video or multi-frame;
denotes single-frame or image;
refers to latent
representation;
to spatial representation;
to text modality;
to action representation;
to
camera (pose) information;
to depth feature;
to neural signal;
to object-level feature;
to
physical information; and
to historical memory feature. Evals and Downstream Apps: Physical generation, question answering, interaction, understanding, attributes are abbreviated as PG,Q,I,U,A, A refers to action prediction. M stands for motion planning. F stands for fluid dynamics. In downstream applications: W is real world, R is robotics, D is autonomous driving, and O is objects.
indicates video or multi-frame;
denotes single-frame or image;
refers to latent
representation;
to spatial representation;
to text modality;
to action representation;
to
camera (pose) information;
to depth feature;
to neural signal;
to object-level feature;
to
physical information; and
to historical memory feature. Evals and Downstream Apps: Physical generation, question answering, interaction, understanding, attributes are abbreviated as PG,Q,I,U,A, A refers to action prediction. M stands for motion planning. F stands for fluid dynamics. In downstream applications: W is real world, R is robotics, D is autonomous driving, and O is objects.| Work | Venue | Method | Input / Output | Conditions | Evals and Apps |
|---|---|---|---|---|---|
| Explicit Priors & Feedback Integration | |||||
| Pandora [126] | arXiv’24 | Physical Prompts |
/
|
![]() |
|
| WorldGPT [127] | MM’24 | Modality Alignment |
![]() /
|
![]()
|
|
| LLMPhy [128] | arXiv’24 | Engine Integration |
/
|
![]() ![]()
|
|
| DrivePhysica [129] | arXiv’24 | Positional Constraints |
/
|
![]() ![]() ![]()
|
|
| PhysTwin [130] | ICCV’25 | Attribute Fusion |
![]() /
|
None | |
| SlotPi [131] | SIGKDD’25 | Physical Constraints |
/
|
None | |
| S2-SSM [132] | arXiv’25 | Sparse Regularization |
/
|
None | |
| RenderWorld [133] | ICRA’25 | Pretraining |
![]() ![]() /![]()
|
![]()
|
|
| DINO-WM [97] | ICML’25 | Pretraining Priors |
![]() /
|
None | |
| HERMES [134] | ICCV’25 | Multi-view Modeling |
![]() /![]()
|
![]()
|
|
| Cosmos [93] | arXiv’25 | Multimodal Constraints |
/
|
![]() ![]()
|
|
| Disentangling Static and Dynamic Factors | |||||
| AdaWorld [105] | ICML’25 | Action Decoupling |
/![]()
|
![]()
|
|
| Dyn-O [65] | NeurIPS’25 | Dynamic Decoupling |
![]() /
|
![]() |
|
| ContextWM [135] | NeurIPS’23 | Dynamic Decoupling |
![]() /
|
![]() ![]()
|
|
| DisWM [136] | ICCV’25 | Dynamic Decoupling |
/
|
None | |
| DreamDojo [70] | RAL’26 | Explicit Action Modeling |
![]() ![]() /![]()
|
![]() |
|
| DreamZero [29] | arXiv’26 | Action Decoupling |
![]() ![]() /
|
![]() ![]()
|
|
| OC-STORM [137] | arXiv’25 | Object Extraction |
![]() ![]() /![]()
|
None | |
| AD3 [138] | ICML’24 | Action Decoupling |
![]() /
|
![]() |
|
| LongDWM [139] | arXiv’25 | Action Decoupling |
![]() /
|
![]() ![]()
|
|
| Vidar [95] | arXiv’25 | Action Decoupling |
/
|
![]() ![]() ![]()
|
|
| DREAMGEN [87] | arXiv’25 | Pseudo Action Estimation |
/![]()
|
![]() ![]()
|
|
| VLMWM [90] | arXiv’25 | Fine-tuning |
![]() /![]()
|
None | |
| WorldDreamer [98] | arXiv’24 | Disentangled Modeling |
/
|
![]()
|
|
| Simulus [111] | arXiv’25 | Dynamic Decoupling |
![]() /
|
![]()
|
|
| SCALOR [140] | ICLR’20 | Background Modeling |
/![]()
|
![]() |
|
| AETHER [141] | ICCV’25 | Unified Modeling |
![]() ![]() /![]() ![]()
|
None | |
| UWM [28] | RSS’25 | Action Decoupling |
![]() /![]()
|
None | |
| FLARE [89] | CoRL’25 | Unified Modeling |
![]() /
|
![]()
|
|
| Progressive Constraints & Hierarchical Abstraction | |||||
| DWS [75] | AAAI’26 | Regularization |
/ /
|
![]() |
|
| Dreamland [94] | arXiv’25 | Engine Simulation |
![]() ![]() /
|
![]() |
|
| GWM [142] | ICCV’25 | Hierarchical Abstraction |
/ /
|
![]()
|
|
| PIWM [143] | arXiv’24 | Interpretability |
![]() ![]() /![]()
|
None | |
| Ross et al. [144] | ICLR’25 | Theoretical Framework |
![]() ![]() /![]()
|
None | |
| SimWorld [92] | arXiv’25 | Simulation-based Modeling |
![]() /
|
![]() ![]() ![]()
|
|
| MoSim [100] | CVPR’25 | Multi-constraint |
![]() /
|
None | |
| WALL-E [145] | NeurIPS’25 | Rule Learning |
![]() ![]() /![]() ![]()
|
![]()
|
|
| FOLIAGE [146] | arXiv’25 | Hierarchical Abstraction |
![]() /
|
![]() ![]() ![]()
|
|
| LLMPHY [128] | arXiv’24 | Hierarchical Abstraction |
![]() /
|
![]() ![]()
|
|
| PILWM [147] | arXiv’25 | Soft Mask |
/
|
None | |
| VLWM [148] | arXiv’25 | Hierarchical Abstraction |
/![]()
|
![]() ![]()
|
|
| V-JEPA 2 [12] | arXiv’25 | Hierarchical pretraining |
/
|
![]() ![]()
|
|
2.5. Universality

3. Three Paradigm of Physical World Model
3.1. Learning from Physical Priors
3.2. Learning from Action Decoupling
3.3. Hierarchical Progressive Learning
4. Benchmarks

4.1. Datasets
4.1.1. Robot Manipulation
4.1.2. Planning and Decision-Making
| Name | Category | Modalities | Composition | Size | Brief Summary |
|---|---|---|---|---|---|
| AgiBot-World [179] | G-R | JTIDC | 1M Tr./217 Ts | 43.8T | Humanoid robot training in manipulation, tools, and collaboration |
| EmbodiedBench [187] | G-RS | JNDT | None/1k Ts | – | Evaluating embodied agents in planning and control |
| Open X [159] | G-R | – | 1M+ Tr./160k Ts | 8.9T | Multi-robot learning for cross-embodiment manipulation transfer |
| BEHAVIOR-1K [188] | G-S | JDTOI | None/1k Ts | 165G | Everyday household activities in simulation for AI agents |
| DMC [172] | G-S | JN | Gen. Tr./50+ Ts | – | DeepMind continuous control tasks for RL in locomotion and manipulation |
| RoboCasa [169] | G-S | JITD | 100k Tr./100 Ts | – | Simulation for generalist robots in kitchens with diverse assets |
| RoboVerse [184] | G-S | MYJI | 500k+ Tr./1k+ Ts | 23G | Unified simulation for robot learning across tasks |
| DexArt [170] | F-S | SJ | None/4 Ts | – | Dexterous manipulation of articulated objects with robotic hands |
| MyoSuite [178] | Sk-S | JN | 10k Tr./204 Ts | – | Musculoskeletal models for dexterous human-like control |
| ARMBench [171] | Sr-R | JTO | 240k Tr./3 Ts | – | Amazon warehouse pick-and-place perception and manipulation |
| Bridge [167] | Sr-R | MT | 60k Tr./13 Ts | 387G | Multi-task manipulation from demonstration data |
| DROID [162] | Sr-R | DCT | 76k Tr./86 Ts | 1.7T | In-the-wild manipulation from mobile robots in offices |
| FurnitureBench [183] | Sr-R | J | 5k Tr./8 Ts | 55G | Long-horizon manipulation such as furniture assembly |
| RH20T [182] | Sr-R | JDNT | 110k Tr./150+ Ts | 5T | Multimodal contact-rich robotic skills for one-shot learning |
| RoboAgent [186] | Sr-R | M | 100k Tr./38 Ts | 425G | Manipulation demonstrations for task-specific learning |
| RoboNet [181] | Sr-R | J | 162k Tr./None | 0.8T | Multi-robot transfer learning in tabletop manipulation |
| RT-1 [160] | Sr-R | T | 130k+ Tr./700+ Ts | 111G | Visuomotor policies from large-scale robot data |
| Franka Kitchen [173] | Sr-S | J | 513 Tr./22 Ts | – | Kitchen interaction tasks with Franka robot |
| LIBERO [161] | Sr-S | TJ | 1693 Tr./130 Ts | – | Long-horizon task learning and generalization |
| ManiSkill2 [177] | Sr-S | JDSON | 4M Tr./20 Ts | 151G | Generalizable manipulation across robots and environments |
| Meta-World [174] | Sr-S | J | 2M Tr./50 Ts | 46G | Meta-RL manipulation for fast adaptation |
| MuJoCo Pusher [176] | Sr-S | JN | 5k Tr./1 Ts | – | Continuous control pushing task in MuJoCo |
| PushT [163] | Sr-S | J | 122 Tr./1 Ts | 2.8G | Tabletop pushing interaction tasks |
| RLBench [165] | Sr-S | DOJ | Gen. Tr./100 Ts | – | Simulation for RL and imitation learning |
| RoboMM [185] | Sr-S | MCDTJ | 70k Tr./100+ Ts | – | Multimodal generalist manipulation model |
| VP2 [168] | Sr-S | J | 310 Tr./11 Ts | 182G | Visual planning for object manipulation |
| Robomimic [180] | S/Dr-S | J | 5.9k Tr./5 Ts | 19G | Offline imitation and RL for manipulation |
| Name | Modalities | Composition | Brief Summary |
|---|---|---|---|
| RealEstate10K [200] | – | 80k video-extracted trajectories | Camera trajectories, intrinsics, and poses |
| Procgen Benchmark [204] | – | Programmatically generated | 16 diverse games with varying difficulty |
| DeepMind Lab [109] | AR | 80k trajectories | 3D first-person navigation and control |
| DrivingDojo [201] | ASTO | 18k videos | Ego-vehicle actions, multi-agent interaction, open-world driving |
| Atari [205] | IAR | Programmatically generated | 50+ classic Atari games |
| Habitat [190] | IDOSR | – | Indoor navigation and task interaction simulation |
| World-in-World [208] | IDMATC | 4 platforms and 4 tasks | Closed-loop interactive embodied tasks |
| KITTI [195] | LTDOS | 180G videos, 100k trajectories | Mobile robotics: perception, SLAM, planning, and detection |
| Waymo Open [198] | MLSO | 3 video subsets | Multi-city driving perception and behavior prediction |
| JRDB [194] | MLO | 54 scenes | Indoor/outdoor navigation and human-robot interaction |
| PointMaze [189] | SARI | 10M states | Point-mass maze navigation |
| Bsuite [206] | SAT | Programmatically generated | T-maze, umbrella task, and path exploration |
| Benchmark [207] | TSA | 571 demonstrations | Language-guided indoor navigation and interaction |
| OpenDV [199] | TALM | 3T video | ∼2059 hours of real-world driving videos |
| WorldArena [196] | AIT | 500 videos, 100k trajectories | Open-world embodied evaluation with planning and interaction |
4.1.3. Physical Perception
4.1.4. Quality and Understanding of Visual Physics
| Name | Semantic Category | Annotation Category | Composition |
|---|---|---|---|
| VSPW [225] | Indoor/Outdoor | – | 3.5k videos |
| Cityscapes [226] | Urban | Semantic, depth & camera parameters | 25k labeled videos |
| ShotBench [227] | Photography | Shot size, motion, lighting, layout | 3.5k question answering pairs. |
| ManipBench [229] | Robotics | Task QA, deformation understanding | 13k question answering pairs. |
| Matterport3D [191] | Indoor | Asset labels, depth, normals | 90 buildings, 11k rooms |
| R2R [230] | Indoor | Room, asset, and action descriptions | 7k paths |
| GameWorld Score [7] | Games | Quality evaluation, spatio-temporal consistency | 1k hours labeled |
| SAT [231] | Indoor/Outdoor | Spatial QA, motion annotation | 218k question answering pairs |
| WhatsUp [232] | Indoor/Outdoor | Spatial positions | 820 question answering pairs |
| NuInteract [233] | Autonomous Driving | Description, spatial information, task category | 850 scenes |
| OmniDrive [234] | Autonomous Driving | Lane-object and counterfactual reasoning | – |
| Sekai [239] | Urban, Games | Position, scene, weather, time, trajectories | 5000+ hours videos |
| OmniWorld [235] | Real/Sim. World | Games, embodied, navigation, planning | 600k videos |
| Cosmos-Reason1 [240] | Real/Sim. World | Spatio-temporal and physical annotations | 1.7M question answering pairs |
| WorldPrediction [236] | Third-person | Healthcare, assembly, repair | 810 instructional videos |
| PhyWorldBench [237] | Real/Sim. World | Physics categories (10 × 5 subcategories) | 1,050 prompts |
| PhysBench [222] | Real/Sim. World | Object properties, relations, scene dynamics | 10k question answering pairs |
| WorldModelBench [238] | Real/Sim. World | Comprehensive domains and disciplines | 350 instances |
4.2. Metrics
4.2.1. Visual Physics Evaluation Metrics
4.2.2. Control, Planning, and Decision-Making Metrics
4.2.3. Physical Grounded Metrics
5. Future Directions and Discussion
5.1. Open Challenges
5.2. Industrialization and Deployment
5.3. Safety and Ethical Challenges
6. Conclusion
Conflicts of Interest
References
- Ha, D.; Schmidhuber, J. World models. arXiv preprint arXiv:1803.10122 2018, 2.
- Zhang, P.F.; Cheng, Y.; Sun, X.; Wang, S.; Zhu, L.; Shen, H.T. A Step Toward World Models: A Survey on Robotic Manipulation. arXiv preprint arXiv:2511.02097 2025.
- Tu, S.; Zhou, X.; Liang, D.; Jiang, X.; Zhang, Y.; Li, X.; Bai, X. The role of world models in shaping autonomous driving: A comprehensive survey. arXiv preprint arXiv:2502.10498 2025.
- Feng, T.; Wang, W.; Yang, Y. A survey of world models for autonomous driving. arXiv preprint arXiv:2501.11260 2025.
- Guan, Y.; Liao, H.; Li, Z.; Hu, J.; Yuan, R.; Zhang, G.; Xu, C. World models for autonomous driving: An initial survey. IEEE Transactions on Intelligent Vehicles 2024. [CrossRef]
- Li, J.; Tang, J.; Xu, Z.; Wu, L.; Zhou, Y.; Shao, S.; Yu, T.; Cao, Z.; Lu, Q. Hunyuan-GameCraft: High-dynamic Interactive Game Video Generation with Hybrid History Condition. arXiv preprint arXiv:2506.17201 2025.
- Zhang, Y.; Peng, C.; Wang, B.; Wang, P.; Zhu, Q.; Kang, F.; Jiang, B.; Gao, Z.; Li, E.; Liu, Y.; et al. Matrix-Game: Interactive World Foundation Model. arXiv preprint arXiv:2506.18701 2025.
- Medsker, L.R.; Jain, L.; et al. Recurrent neural networks. Design and applications 2001, 5, 2.
- Hafner, D.; Lillicrap, T.; Ba, J.; Norouzi, M. Dream to Control: Learning Behaviors by Latent Imagination. In Proceedings of the International Conference on Learning Representations, 2020.
- Hafner, D.; Lillicrap, T.; Norouzi, M.; Ba, J. Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193 2020.
- Hafner, D.; Pasukonis, J.; Ba, J.; Lillicrap, T. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104 2023.
- Assran, M.; Bardes, A.; Fan, D.; Garrido, Q.; Howes, R.; Muckley, M.; Rizvi, A.; Roberts, C.; Sinha, K.; Zholus, A.; et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985 2025.
- OpenAI. Sora 2 is here. https://openai.com/index/sora-2/, 2025. Accessed: 2025-06-05.
- DeepMind. Genie 3: A new frontier for world models. https://deepmind.google/discover/blog/genie-3-a-new-frontier-for-world-models/, 2025. Accessed: 2025-06-05.
- Yue, J.; Huang, Z.; Chen, Z.; Wang, X.; Wan, P.; Liu, Z. Simulating the Visual World with Artificial Intelligence: A Roadmap. arXiv preprint arXiv:2511.08585 2025.
- Ding, J.; Zhang, Y.; Shang, Y.; Zhang, Y.; Zong, Z.; Feng, J.; Yuan, Y.; Su, H.; Li, N.; Sukiennik, N.; et al. Understanding world or predicting future? a comprehensive survey of world models. ACM Computing Surveys 2024. [CrossRef]
- Zhu, Z.; Wang, X.; Zhao, W.; Min, C.; Deng, N.; Dou, M.; Wang, Y.; Shi, B.; Wang, K.; Zhang, C.; et al. Is sora a world simulator? a comprehensive survey on general world models and beyond. arXiv preprint arXiv:2405.03520 2024.
- Lin, M.; Wang, X.; Wang, Y.; Wang, S.; Dai, F.; Ding, P.; Wang, C.; Zuo, Z.; Sang, N.; Huang, S.; et al. Exploring the evolution of physics cognition in video generation: A survey. arXiv preprint arXiv:2503.21765 2025.
- Liu, D.; Zhang, J.; Dinh, A.D.; Park, E.; Zhang, S.; Mian, A.; Shah, M.; Xu, C. Generative physical ai in vision: A survey. arXiv preprint arXiv:2501.10928 2025.
- Xie, N.; Tian, Z.; Yang, L.; Zhang, X.P.; Guo, M.; Li, J. From 2D to 3D Cognition: A Brief Survey of General World Models. arXiv preprint arXiv:2506.20134 2025.
- Chen, J.; Zhu, H.; He, X.; Wang, Y.; Zhou, J.; Chang, W.; Zhou, Y.; Li, Z.; Fu, Z.; Pang, J.; et al. DeepVerse: 4D Autoregressive Video Generation as a World Model. arXiv preprint arXiv:2506.01103 2025.
- Chen, T.; Hu, X.; Ding, Z.; Jin, C. Learning World Models for Interactive Video Generation. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025.
- Wang, S.; Tian, J.; Wang, L.; Liao, Z.; lijiayi.; Dong, H.; Xia, K.; Zhou, S.; Tang, W.; Hua, G. SAMPO: Scale-wise Autoregression with Motion Prompt for Generative World Models. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025.
- Xiang, J.; Gu, Y.; Liu, Z.; Feng, Z.; Gao, Q.; Hu, Y.; Huang, B.; Liu, G.; Yang, Y.; Zhou, K.; et al. PAN: A World Model for General, Interactable, and Long-Horizon World Simulation. arXiv preprint arXiv:2511.09057 2025.
- Huang, S.; Chen, L.; Zhou, P.; Chen, S.; Liao, Y.; Jiang, Z.; Hu, Y.; Gao, P.; Li, H.; Yao, M.; et al. EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025.
- Chi, X.; Fan, C.K.; Zhang, H.; Qi, X.; Zhang, R.; Chen, A.; Chan, C.m.; Xue, W.; Liu, Q.; Zhang, S.; et al. Eva: An embodied world model for future video anticipation. arXiv preprint arXiv:2410.15461 2024.
- Mao, X.; Lin, S.; Li, Z.; Li, C.; Peng, W.; He, T.; Pang, J.; Chi, M.; Qiao, Y.; Zhang, K. Yume: An interactive world generation model. arXiv preprint arXiv:2507.17744 2025.
- Zhu, C.; Yu, R.; Feng, S.; Burchfiel, B.; Shah, P.; Gupta, A. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets. arXiv preprint arXiv:2504.02792 2025.
- Zhi, H.; Chen, P.; Zhou, S.; Dong, Y.; Wu, Q.; Han, L.; Tan, M. 3DFlowAction: Learning Cross-Embodiment Manipulation from 3D Flow World Model. arXiv preprint arXiv:2506.06199 2025.
- Chen, B.; Martí Monsó, D.; Du, Y.; Simchowitz, M.; Tedrake, R.; Sitzmann, V. Diffusion forcing: Next-token prediction meets full-sequence diffusion. Advances in Neural Information Processing Systems 2024, 37, 24081–24125.
- Zhang, K.; Tang, Z.; Hu, X.; Pan, X.; Guo, X.; Liu, Y.; Huang, J.; Yuan, L.; Zhang, Q.; Long, X.X.; et al. Epona: Autoregressive diffusion world model for autonomous driving. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 27220–27230.
- Alonso, E.; Jelley, A.; Micheli, V.; Kanervisto, A.; Storkey, A.J.; Pearce, T.; Fleuret, F. Diffusion for world modeling: Visual details matter in atari. Advances in Neural Information Processing Systems 2024, 37, 58757–58791.
- Ye, D.; Zhou, F.; Lv, J.; Ma, J.; Zhang, J.; Lv, J.; Li, J.; Deng, M.; Yang, M.; Fu, Q.; et al. Yan: Foundational interactive video generation. arXiv preprint arXiv:2508.08601 2025.
- Hassan, M.; Stapf, S.; Rahimi, A.; Rezende, P.; Haghighi, Y.; Brüggemann, D.; Katircioglu, I.; Zhang, L.; Chen, X.; Saha, S.; et al. Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 22404–22415.
- Koh, J.Y.; Lee, H.; Yang, Y.; Baldridge, J.; Anderson, P. Pathdreamer: A world model for indoor navigation. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 14738–14748.
- Tu, Y.; Luo, H.; Chen, X.; Bai, X.; Wang, F.; Zhao, H. PlayerOne: Egocentric World Simulator. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025.
- Savov, N.; Kazemi, N.; Zhang, D.; Paudel, D.P.; Wang, X.; Gool, L.V. StateSpaceDiffuser: Bringing Long Context to Diffusion World Models. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025.
- Huang, S.; Wu, J.; Zhou, Q.; Miao, S.; Long, M. Vid2World: Crafting Video Diffusion Models to Interactive World Models. arXiv preprint arXiv:2505.14357 2025.
- Robine, J.; Höftmann, M.; Harmeling, S. Simple, Good, Fast: Self-Supervised World Models Free of Baggage. In Proceedings of the The Thirteenth International Conference on Learning Representations, 2025.
- Cui, Y.; Chen, H.; Deng, H.; Huang, X.; Li, X.; Liu, J.; Liu, Y.; Luo, Z.; Wang, J.; Wang, W.; et al. Emu3. 5: Native Multimodal Models are World Learners. arXiv preprint arXiv:2510.26583 2025.
- Shang, Y.; Zhang, X.; Tang, Y.; Jin, L.; Gao, C.; Wu, W.; Li, Y. RoboScape: Physics-informed Embodied World Model. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025.
- Li, Y.; Wei, X.; Chi, X.; Li, Y.; Zhao, Z.; Wang, H.; Ma, N.; Lu, M.; Zhang, S. ManipDreamer: Boosting Robotic Manipulation World Model with Action Tree and Visual Guidance. arXiv preprint arXiv:2504.16464 2025.
- Li, S.; Yang, C.; Fang, J.; Yi, T.; Lu, J.; Cen, J.; Xie, L.; Shen, W.; Tian, Q. Worldgrow: Generating infinite 3d world. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2026, Vol. 40, pp. 6433–6441. [CrossRef]
- Russell, L.; Hu, A.; Bertoni, L.; Fedoseev, G.; Shotton, J.; Arani, E.; Corrado, G. Gaia-2: A controllable multi-view generative world model for autonomous driving. arXiv preprint arXiv:2503.20523 2025.
- Zhang, Q.; Zhai, S.; Martin, M.A.B.; Miao, K.; Toshev, A.; Susskind, J.; Gu, J. World-consistent video diffusion with explicit 3d modeling. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 21685–21695.
- Li, X.; Wang, T.; Gu, Z.; Zhang, S.; Guo, C.; Cao, L. FlashWorld: High-quality 3D Scene Generation within Seconds. In Proceedings of the The Fourteenth International Conference on Learning Representations, 2026.
- Wu, H.; Wu, D.; He, T.; Guo, J.; Ye, Y.; Duan, Y.; Bian, J. Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling. In Proceedings of the The Fourteenth International Conference on Learning Representations, 2026.
- Lu, Y.; Ren, X.; Yang, J.; Shen, T.; Wu, Z.; Gao, J.; Wang, Y.; Chen, S.; Chen, M.; Fidler, S.; et al. Infinicube: Unbounded and controllable dynamic 3d driving scene generation with world-guided video models. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 27272–27283.
- Yang, Y.; Liu, J.; Zhang, Z.; Zhou, S.; Tan, R.; Yang, J.; Du, Y.; Gan, C. MindJourney: Test-Time Scaling with World Models for Spatial Reasoning. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025.
- Liang, D.; Zhang, D.; Zhou, X.; Tu, S.; Feng, T.; Li, X.; Zhang, Y.; Du, M.; Tan, X.; Bai, X. Seeing the Future, Perceiving the Future: A unified driving world model for future generation and perception. arXiv preprint arXiv:2503.13587 2025.
- Lee, J.H.; Lin, B.J.; Sun, W.F.; Lee, C.Y. EDELINE: Enhancing Memory in Diffusion-based World Models via Linear-Time Sequence Modeling. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025.
- Guo, Y.; Shi, L.X.; Chen, J.; Finn, C. Ctrl-World: A Controllable Generative World Model for Robot Manipulation. In Proceedings of the International Conference on Learning Representations (ICLR), 2026.
- Wu, T.; Yang, S.; Po, R.; Xu, Y.; Liu, Z.; Lin, D.; Wetzstein, G. Video World Models with Long-term Spatial Memory. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025.
- Xiao, Z.; LAN, Y.; Zhou, Y.; Ouyang, W.; Yang, S.; Zeng, Y.; Pan, X. WorldMem: Long-term Consistent World Simulation with Memory. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025.
- Wang, G.; Xie, Y.; Jiang, Y.; Mandlekar, A.; Xiao, C.; Zhu, Y.; Fan, L.; Anandkumar, A. Voyager: An Open-Ended Embodied Agent with Large Language Models. Transactions on Machine Learning Research 2024.
- Po, R.; Nitzan, Y.; Zhang, R.; Chen, B.; Dao, T.; Shechtman, E.; Wetzstein, G.; Huang, X. Long-context state-space video world models. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 8733–8744.
- Huang, J.; Hu, X.; Han, B.; Shi, S.; Tian, Z.; He, T.; Jiang, L. Memory Forcing: Spatio-Temporal Memory for Consistent Scene Generation on Minecraft. arXiv preprint arXiv:2510.03198 2025.
- Collu, J.; Majellaro, R.; Plaat, A.; Moerland, T.M. Slot Structured World Models. arXiv preprint arXiv:2402.03326 2024.
- Traub, M.; Otte, S.; Menge, T.; Karlbauer, M.; Thuemmel, J.; Butz, M.V. Learning What and Where: Disentangling Location and Identity Tracking Without Supervision. In Proceedings of the The Eleventh International Conference on Learning Representations, 2023.
- Elsayed, G.; Mahendran, A.; Van Steenkiste, S.; Greff, K.; Mozer, M.C.; Kipf, T. Savi++: Towards end-to-end object-centric learning from real-world videos. Advances in Neural Information Processing Systems 2022, 35, 28940–28954.
- Zhang, Y.; Guo, X.; Xu, H.; Long, M. Consistent World Models via Foresight Diffusion. arXiv preprint arXiv:2505.16474 2025.
- Hu, W.; Wen, X.; Li, X.; Wang, G. DSG-World: Learning a 3D Gaussian World Model from Dual State Videos. arXiv preprint arXiv:2506.05217 2025.
- Huang, T.; Zheng, W.; Wang, T.; Liu, Y.; Wang, Z.; Wu, J.; Jiang, J.; Li, H.; Lau, R.; Zuo, W.; et al. Voyager: Long-range and world-consistent video diffusion for explorable 3d scene generation. ACM Transactions on Graphics (TOG) 2025, 44, 1–15. [CrossRef]
- Zhou, S.; Du, Y.; Yang, Y.; Han, L.; Chen, P.; Yeung, D.Y.; Gan, C. Learning 3D Persistent Embodied World Models. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025.
- Wang, Z.; Wang, K.; Zhao, L.; Stone, P.; Bian, J. Dyn-O: Building Structured World Models with Object-Centric Representations. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025.
- Ferraro, S.; Mazzaglia, P.; Verbelen, T.; Dhoedt, B. FOCUS: object-centric world models for robotic manipulation. Frontiers in Neurorobotics 2025, 19, 1585386. [CrossRef]
- Barcellona, L.; Zadaianchuk, A.; Allegro, D.; Papa, S.; Ghidoni, S.; Gavves, S. Dream to Manipulate: Compositional World Models Empowering Robot Imitation Learning with Imagination. In Proceedings of the Greeks in AI Symposium 2025, 2025.
- Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. DINOv2: Learning Robust Visual Features without Supervision. Transactions on Machine Learning Research Journal 2024.
- Huang, Y.; Zhang, J.; Zou, S.; Liu, X.; Hu, R.; Xu, K. LaDi-WM: A Latent Diffusion-Based World Model for Predictive Manipulation. In Proceedings of the Proceedings of The 9th Conference on Robot Learning; Lim, J.; Song, S.; Park, H.W., Eds. PMLR, 27–30 Sep 2025, Vol. 305, Proceedings of Machine Learning Research, pp. 1726–1743.
- Guo, J.; Ma, X.; Wang, Y.; Yang, M.; Liu, H.; Li, Q. Flowdreamer: A rgb-d world model with flow-based motion representations for robot manipulation. IEEE Robotics and Automation Letters 2026, 11, 2466–2473. [CrossRef]
- Bar, A.; Zhou, G.; Tran, D.; Darrell, T.; LeCun, Y. Navigation world models. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 15791–15801.
- Peebles, W.; Xie, S. Scalable diffusion models with transformers. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4195–4205.
- Rigter, M.; Gupta, T.; Hilmkil, A.; Ma, C. AVID: Adapting Video Diffusion Models to World Models. In Proceedings of the Reinforcement Learning Conference, 2024.
- Wu, J.; Yin, S.; Feng, N.; He, X.; Li, D.; Hao, J.; Long, M. ivideogpt: Interactive videogpts are scalable world models. Advances in Neural Information Processing Systems 2024, 37, 68082–68119.
- He, H.; Zhang, Y.; Lin, L.; Xu, Z.; Pan, L. Pre-trained video generative models as world simulators. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2026, Vol. 40, pp. 4645–4653. [CrossRef]
- Zhao, B.; Tang, R.; Jia, M.; Wang, Z.; Man, F.; Zhang, X.; Shang, Y.; Zhang, W.; Wu, W.; Gao, C.; et al. AirScape: An Aerial Generative World Model with Motion Controllability. In Proceedings of the Proceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 12519–12528.
- Robotics, U. UnifoLM-WMA-0: A World-Model-Action (WMA) Framework under UnifoLM Family. https://github.com/unitreerobotics/unifolm-world-model-action, 2025. Open-source world-model–action architecture spanning multiple types of robotic embodiments.
- Hayashi, K.; Koyama, M.; Guerreiro, J.J.A. Inter-environmental world modeling for continuous and compositional dynamics. arXiv preprint arXiv:2503.09911 2025.
- Durante, Z.; Gong, R.; Sarkar, B.; Wake, N.; Taori, R.; Tang, P.; Lakshmikanth, S.; Schulman, K.; Milstein, A.; Vo, H.; et al. An interactive agent foundation model. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 3652–3662.
- Zhen, H.; Sun, Q.; Zhang, H.; Li, J.; Zhou, S.; Du, Y.; Gan, C. TesserAct: learning 4D embodied world models. arXiv preprint arXiv:2504.20995 2025.
- Duan, Y.; Zou, Z.; Gu, T.; Jia, W.; Zhao, Z.; Xu, L.; Liu, X.; Lin, Y.; Jiang, H.; Chen, K.; et al. LatticeWorld: A Multimodal Large Language Model-Empowered Framework for Interactive Complex World Generation. arXiv preprint arXiv:2509.05263 2025.
- Guo, J.; Ye, Y.; He, T.; Wu, H.; Jiang, Y.; Pearce, T.; Bian, J. Mineworld: a real-time and open-source interactive world model on minecraft. arXiv preprint arXiv:2504.08388 2025.
- Li, J.; Tang, J.; Xu, Z.; Wu, L.; Zhou, Y.; Shao, S.; Yu, T.; Cao, Z.; Lu, Q. Hunyuan-GameCraft: High-dynamic Interactive Game Video Generation with Hybrid History Condition. arXiv preprint arXiv:2506.17201 2025.
- Lab, D. Mirage 2 — Generative World Engine. https://www.mirage2.org/, 2025. Browser-based system to generate explorable 3D worlds from images/text.
- Labs, W. World Labs: spatial intelligence for large world models. https://www.worldlabs.ai/, 2025. Accessed: 2025-06-05.
- Yang, Z.; Ge, W.; Li, Y.; Chen, J.; Li, H.; An, M.; Kang, F.; Xue, H.; Xu, B.; Yin, Y.; et al. Matrix-3d: Omnidirectional explorable 3d world generation. arXiv preprint arXiv:2508.08086 2025.
- Jang, J.; Ye, S.; Lin, Z.; Xiang, J.; Bjorck, J.; Fang, Y.; Hu, F.; Huang, S.; Kundalia, K.; Lin, Y.C.; et al. DreamGen: Unlocking Generalization in Robot Learning through Video World Models. In Proceedings of the 9th Annual Conference on Robot Learning, 2025.
- Kerbl, B.; Kopanas, G.; Leimkühler, T.; Drettakis, G. 3D Gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 2023, 42, 139–1. [CrossRef]
- Zheng, R.; Wang, J.; Reed, S.; Bjorck, J.; Fang, Y.; Hu, F.; Jang, J.; Kundalia, K.; Lin, Z.; Magne, L.; et al. FLARE: Robot Learning with Implicit World Modeling. In Proceedings of the Proceedings of The 9th Conference on Robot Learning; Lim, J.; Song, S.; Park, H.W., Eds. PMLR, 27–30 Sep 2025, Vol. 305, Proceedings of Machine Learning Research, pp. 3952–3971.
- Qiu, Y.; Ziser, Y.; Korhonen, A.; Cohen, S.B.; Ponti, E.M. Bootstrapping World Models from Dynamics Models in Multimodal Foundation Models. arXiv preprint arXiv:2506.06006 2025.
- Kadian, A.; Truong, J.; Gokaslan, A.; Clegg, A.; Wijmans, E.; Lee, S.; Savva, M.; Chernova, S.; Batra, D. Sim2real predictivity: Does evaluation in simulation predict real-world performance? IEEE Robotics and Automation Letters 2020, 5, 6670–6677. [CrossRef]
- Li, X.; Song, R.; Xie, Q.; Wu, Y.; Zeng, N.; Ai, Y. Simworld: A unified benchmark for simulator-conditioned scene generation via world model. In Proceedings of the 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 927–934.
- Agarwal, N.; Ali, A.; Bala, M.; Balaji, Y.; Barker, E.; Cai, T.; Chattopadhyay, P.; Chen, Y.; Cui, Y.; Ding, Y.; et al. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575 2025.
- Mo, S.; Leng, Z.; Liu, L.; Wang, W.; He, H.; Zhou, B. Dreamland: Controllable World Creation with Simulator and Generative Models. arXiv preprint arXiv:2506.08006 2025.
- Feng, Y.; Tan, H.; Mao, X.; Liu, G.; Huang, S.; Xiang, C.; Su, H.; Zhu, J. Vidar: Embodied video diffusion model for generalist bimanual manipulation. arXiv preprint arXiv:2507.12898 2025.
- Wang, Y.; Yu, R.; Wan, S.; Gan, L.; Zhan, D.C. Founder: Grounding foundation models in world models for open-ended embodied decision making. In Proceedings of the Forty-second International Conference on Machine Learning, 2025.
- Zhou, G.; Pan, H.; LeCun, Y.; Pinto, L. DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning. In Proceedings of the Forty-second International Conference on Machine Learning, 2025.
- Wang, X.; Zhu, Z.; Huang, G.; Wang, B.; Chen, X.; Lu, J. Worlddreamer: Towards general world models for video generation via predicting masked tokens. arXiv preprint arXiv:2401.09985 2024.
- Schiewer, R.; Subramoney, A.; Wiskott, L. Exploring the limits of hierarchical world models in reinforcement learning. Scientific Reports 2024, 14, 26856. [CrossRef]
- Hao, C.; Lu, W.; Xu, Y.; Chen, Y. Neural Motion Simulator Pushing the Limit of World Models in Reinforcement Learning. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 27608–27617.
- Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 2020.
- Mazzaglia, P.; Verbelen, T.; Dhoedt, B.; Courville, A.; Rajeswar, S. GenRL: Multimodal-foundation world models for generalization in embodied agents. Advances in neural information processing systems 2024, 37, 27529–27555.
- Fang, F.; Liang, W.; Wu, Y.; Xu, Q.; Lim, J.H. Improving generalization of reinforcement learning using a bilinear policy network. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP). IEEE, 2022, pp. 991–995.
- Fang, Q.; Du, W.; Wang, H.; Zhang, J. Towards Unraveling and Improving Generalization in World Models. arXiv preprint arXiv:2501.00195 2024.
- Gao, S.; Zhou, S.; Du, Y.; Zhang, J.; Gan, C. AdaWorld: Learning Adaptable World Models with Latent Actions. In Proceedings of the Forty-second International Conference on Machine Learning, 2025.
- Prasanna, S.; Farid, K.; Rajan, R.; Biedenkapp, A. Dreaming of Many Worlds: Learning Contextual World Models aids Zero-Shot Generalization. In Proceedings of the Seventeenth European Workshop on Reinforcement Learning, 2024.
- Baldassarre, F.; Szafraniec, M.; Terver, B.; Khalidov, V.; Massa, F.; LeCun, Y.; Labatut, P.; Seitzer, M.; Bojanowski, P. Back to the features: Dino as a foundation for video world models. arXiv preprint arXiv:2507.19468 2025.
- Ali, M.Q.; Sridhar, A.; Matiana, S.; Wong, A.; Al-Sharman, M. Humanoid World Models: Open World Foundation Models for Humanoid Robotics. arXiv preprint arXiv:2506.01182 2025.
- Chi, X.; Ge, K.; Liu, J.; Zhou, S.; Jia, P.; He, Z.; Liu, Y.; Li, T.; Han, L.; Han, S.; et al. MinD: Unified Visual Imagination and Control via Hierarchical World Models. arXiv preprint arXiv:2506.18897 2025.
- Chun, J.; Jeong, Y.; Kim, T. Sparse Imagination for Efficient Visual World Model Planning. In Proceedings of the The Fourteenth International Conference on Learning Representations, 2026.
- Cohen, L.; Wang, K.; Kang, B.; Gadot, U.; Mannor, S. Uncovering Untapped Potential in Sample-Efficient World Model Agents. arXiv preprint arXiv:2502.11537 2025.
- Burchi, M.; Timofte, R. Accurate and Efficient World Modeling with Masked Latent Transformers. In Proceedings of the Proceedings of the 42nd International Conference on Machine Learning; Singh, A.; Fazel, M.; Hsu, D.; Lacoste-Julien, S.; Berkenkamp, F.; Maharaj, T.; Wagstaff, K.; Zhu, J., Eds. PMLR, 13–19 Jul 2025, Vol. 267, Proceedings of Machine Learning Research, pp. 5894–5912.
- Zhang, H.; Yan, X.; Xue, Y.; Guo, Z.; Cui, S.; Li, Z.; Liu, B. D2-world: An Efficient World Model through Decoupled Dynamic Flow. arXiv preprint arXiv:2411.17027 2024.
- Pu, Y.; Niu, Y.; Tang, J.; Xiong, J.; Hu, S.; Li, H. One Model for All Tasks: Leveraging Efficient World Models in Multi-Task Planning. arXiv preprint arXiv:2509.07945 2025.
- Li, S.; Hao, Q.; Shang, Y.; Li, Y. KeyWorld: Key Frame Reasoning Enables Effective and Efficient World Models. arXiv preprint arXiv:2509.21027 2025.
- Yamada, J.; Rigter, M.; Collins, J.; Posner, I. Twist: Teacher-student world model distillation for efficient sim-to-real transfer. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 9190–9196.
- Micheli, V.; Alonso, E.; Fleuret, F. Transformers are Sample-Efficient World Models. In Proceedings of the The Eleventh International Conference on Learning Representations, 2023.
- Micheli, V.; Alonso, E.; Fleuret, F. Efficient World Models with Context-Aware Tokenization. In Proceedings of the International Conference on Machine Learning. PMLR, 2024, pp. 35623–35638.
- Song, Q.; Wang, X.; Zhou, D.; Lin, J.; Chen, C.; Ma, Y.; Li, X. Hero: Hierarchical extrapolation and refresh for efficient world models. arXiv preprint arXiv:2508.17588 2025.
- Jin, B.; Li, W.; Yang, B.; Zhu, Z.; Jiang, J.; Gao, H.a.; Sun, H.; Zhan, K.; Hu, H.; Zhang, X.; et al. PosePilot: Steering camera pose for generative world models with self-supervised depth. In Proceedings of the 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 8051–8058.
- Jeong, Y.; Chun, J.; Cha, S.; Kim, T. Object-Centric World Model for Language-Guided Manipulation. In Proceedings of the ICLR 2025 Workshop on World Models: Understanding, Modelling and Scaling, 2025.
- Akbulut, T.; Merlin, M.; Parr, S.; Quartey, B.; Thompson, S. Sample Efficient Robot Learning with Structured World Models. arXiv preprint arXiv:2210.12278 2022.
- Van Den Oord, A.; Vinyals, O.; et al. Neural discrete representation learning. Advances in neural information processing systems 2017, 30.
- Chen, R.; Ko, Y.; Zhang, Z.; Cho, C.; Chung, S.; Giuffré, M.; Shung, D.L.; Stadie, B.C. LAMP: Extracting Locally Linear Decision Surfaces from LLM World Models. arXiv preprint arXiv:2505.11772 2025.
- Zeng, B.; Zhu, K.; Hua, D.; Li, B.; Tong, C.; Wang, Y.; Huang, X.; Dai, Y.; Zhang, Z.; Yang, Y.; et al. Research on World Models Is Not Merely Injecting World Knowledge into Specific Tasks. arXiv preprint arXiv:2602.01630 2026.
- Xiang, J.; Liu, G.; Gu, Y.; Gao, Q.; Ning, Y.; Zha, Y.; Feng, Z.; Tao, T.; Hao, S.; Shi, Y.; et al. Pandora: Towards general world model with natural language actions and video states. arXiv preprint arXiv:2406.09455 2024.
- Ge, Z.; Huang, H.; Zhou, M.; Li, J.; Wang, G.; Tang, S.; Zhuang, Y. Worldgpt: Empowering llm as multimodal world model. In Proceedings of the Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 7346–7355.
- Cherian, A.; Corcodel, R.; Jain, S.; Romeres, D. Llmphy: Complex physical reasoning using large language models and world models. arXiv preprint arXiv:2411.08027 2024.
- Yang, Z.; Guo, X.; Ding, C.; Wang, C.; Wu, W. Physical informed driving world model. arXiv preprint arXiv:2412.08410 2024.
- Jiang, H.; Hsu, H.Y.; Zhang, K.; Yu, H.N.; Wang, S.; Li, Y. Phystwin: Physics-informed reconstruction and simulation of deformable objects from videos. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 7219–7230.
- Li, J.; Wan, H.; Lin, N.; Zhan, Y.L.; Chengze, R.; Wang, H.; Zhang, Y.; Liu, H.; Wang, Z.; Yu, F.; et al. SlotPi: Physics-informed Object-centric Reasoning Models. In Proceedings of the Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, 2025, pp. 1376–1387.
- Petri, F.; Asprino, L.; Gangemi, A. Learning Local Causal World Models with State Space Models and Attention. arXiv preprint arXiv:2505.02074 2025.
- Yan, Z.; Dong, W.; Shao, Y.; Lu, Y.; Liu, H.; Liu, J.; Wang, H.; Wang, Z.; Wang, Y.; Remondino, F.; et al. Renderworld: World model with self-supervised 3d label. In Proceedings of the 2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 6063–6070.
- Zhou, X.; Liang, D.; Tu, S.; Chen, X.; Ding, Y.; Zhang, D.; Tan, F.; Zhao, H.; Bai, X. Hermes: A unified self-driving world model for simultaneous 3d scene understanding and generation. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 27817–27827.
- Wu, J.; Ma, H.; Deng, C.; Long, M. Pre-training contextualized world models with in-the-wild videos for reinforcement learning. Advances in Neural Information Processing Systems 2023, 36, 39719–39743.
- Wang, Q.; Zhang, Z.; Xie, B.; Jin, X.; Wang, Y.; Wang, S.; Zheng, L.; Yang, X.; Zeng, W. Disentangled world models: Learning to transfer semantic knowledge from distracting videos for reinforcement learning. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 2599–2608.
- Zhang, W.; Jelley, A.; McInroe, T.; Storkey, A. Objects matter: object-centric world models improve reinforcement learning in visually complex environments. arXiv preprint arXiv:2501.16443 2025.
- Wang, Y.; Wan, S.; Gan, L.; Feng, S.; Zhan, D.C. AD3: implicit action is the key for world models to distinguish the diverse visual distractors. In Proceedings of the Proceedings of the 41st International Conference on Machine Learning, 2024, pp. 51546–51568.
- Wang, X.; Wu, Z.; Peng, P. LongDWM: Cross-Granularity Distillation for Building a Long-Term Driving World Model. arXiv preprint arXiv:2506.01546 2025.
- Jiang, J.; Janghorbani, S.; De Melo, G.; Ahn, S. SCALOR: Generative World Models with Scalable Object Representations. In Proceedings of the International Conference on Learning Representations, 2020.
- Zhu, H.; Wang, Y.; Zhou, J.; Chang, W.; Zhou, Y.; Li, Z.; Chen, J.; Shen, C.; Pang, J.; He, T. Aether: Geometric-aware unified world modeling. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 8535–8546.
- Lu, G.; Jia, B.; Li, P.; Chen, Y.; Wang, Z.; Tang, Y.; Huang, S. Gwm: Towards scalable gaussian world models for robotic manipulation. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 9263–9274.
- Mao, Z.; Ruchkin, I. Towards Physically Interpretable World Models: Meaningful Weakly Supervised Representations for Visual Trajectory Prediction. arXiv preprint arXiv:2412.12870 2024.
- Ross, E.; Drygala, C.; Schwarz, L.; Kaiser, S.; di Mare, F.; Breiten, T.; Gottschalk, H. When do World Models Successfully Learn Dynamical Systems? arXiv preprint arXiv:2507.04898 2025.
- Zhou, S.; Zhou, T.; Yang, Y.; Long, G.; Ye, D.; Jiang, J.; Zhang, C. WALL-E: World Alignment by NeuroSymbolic Learning improves World Model-based LLM Agents. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025.
- Liu, X.; Tang, H. FOLIAGE: Towards Physical Intelligence World Models Via Unbounded Surface Evolution. arXiv preprint arXiv:2506.03173 2025.
- Wang, D.; Sun, Z.; Li, Z.; Wang, C.; Peng, Y.; Ye, H.; Zarrouki, B.; Li, W.; Piccinini, M.; Xie, L.; et al. Enhancing Physical Consistency in Lightweight World Models. arXiv preprint arXiv:2509.12437 2025.
- Chen, D.; Moutakanni, T.; Chung, W.; Bang, Y.; Ji, Z.; Bolourchi, A.; Fung, P. Planning with reasoning using vision language world model. arXiv preprint arXiv:2509.02722 2025.
- Huh, M.; Cheung, B.; Wang, T.; Isola, P. The platonic representation hypothesis. arXiv preprint arXiv:2405.07987 2024.
- Zhao, Y.; Scannell, A.; Hou, Y.; Cui, T.; Chen, L.; Büchler, D.; Solin, A.; Kannala, J.; Pajarinen, J. Generalist World Model Pre-Training for Efficient Reinforcement Learning. In Proceedings of the ICLR 2025 Workshop on World Models: Understanding, Modelling and Scaling, 2025.
- Technologies, X. 1X World Model. https://www.1x.tech/discover/1x-world-model, 2024. Accessed: 2025-11-14.
- Chen, C.; Wu, Y.F.; Yoon, J.; Ahn, S. Transdreamer: Reinforcement learning with transformer world models. arXiv preprint arXiv:2202.09481 2022.
- Wu, H.; Guo, M.; Li, Z.; Dou, Z.; Long, M.; He, K.; Matusik, W. GeoPT: Scaling Physics Simulation via Lifted Geometric Pre-Training. arXiv preprint arXiv:2602.20399 2026.
- Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of the First conference on language modeling, 2024.
- Ye, S.; Ge, Y.; Zheng, K.; Gao, S.; Yu, S.; Kurian, G.; Indupuru, S.; Tan, Y.L.; Zhu, C.; Xiang, J.; et al. World Action Models are Zero-shot Policies. arXiv preprint arXiv:2602.15922 2026.
- Gao, S.; Liang, W.; Zheng, K.; Malik, A.; Ye, S.; Yu, S.; Tseng, W.C.; Dong, Y.; Mo, K.; Lin, C.H.; et al. DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos. arXiv preprint arXiv:2602.06949 2026.
- Kotar, K.; Lee, W.; Venkatesh, R.; Chen, H.; Bear, D.; Watrous, J.; Kim, S.; Aw, K.L.; Chen, L.N.; Stojanov, S.; et al. World modeling with probabilistic structure integration. arXiv preprint arXiv:2509.09737 2025.
- Saxena, D.; Cao, J. Generative adversarial networks (GANs) challenges, solutions, and future directions. ACM Computing Surveys (CSUR) 2021, 54, 1–42.
- Vuong, Q.; Levine, S.; Walke, H.R.; Pertsch, K.; Singh, A.; Doshi, R.; Xu, C.; Luo, J.; Tan, L.; Shah, D.; et al. Open x-embodiment: Robotic learning datasets and rt-x models. In Proceedings of the Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition@ CoRL2023, 2023.
- Brohan, A.; Brown, N.; Carbajal, J.; Chebotar, Y.; Dabis, J.; Finn, C.; Gopalakrishnan, K.; Hausman, K.; Herzog, A.; Hsu, J.; et al. RT-1: Robotics Transformer for Real-World Control at Scale. Robotics: Science and Systems XIX 2023.
- Liu, B.; Zhu, Y.; Gao, C.; Feng, Y.; Liu, Q.; Zhu, Y.; Stone, P. Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems 2023, 36, 44776–44791.
- Khazatsky, A.; Pertsch, K.; Nair, S.; Balakrishna, A.; Dasari, S.; Karamcheti, S.; Nasiriany, S.; Srirama, M.K.; Chen, L.Y.; Ellis, K.; et al. DROID: A large-scale in-the-wild robot manipulation dataset. In Proceedings of the Robotics: Science and Systems, 2024.
- Chi, C.; Xu, Z.; Feng, S.; Cousineau, E.; Du, Y.; Burchfiel, B.; Tedrake, R.; Song, S. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. The International Journal of Robotics Research 2024. [CrossRef]
- Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. Openai gym. arXiv preprint arXiv:1606.01540 2016.
- James, S.; Ma, Z.; Arrojo, D.R.; Davison, A.J. Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters 2020, 5, 3019–3026. [CrossRef]
- Rohmer, E.; Singh, S.P.; Freese, M. V-REP: A versatile and scalable robot simulation framework. 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems 2013, pp. 1321–1326.
- Walke, H.; Black, K.; Lee, A.; Kim, M.J.; Du, M.; Zheng, C.; Zhao, T.; Hansen-Estruch, P.; Vuong, Q.; He, A.; et al. BridgeData V2: A Dataset for Robot Learning at Scale. In Proceedings of the Conference on Robot Learning (CoRL), 2023.
- Tian, S.; Finn, C.; Wu, J. A Control-Centric Benchmark for Video Prediction. In Proceedings of the The Eleventh International Conference on Learning Representations, 2024.
- Nasiriany, S.; Maddukuri, A.; Zhang, L.; Parikh, A.; Lo, A.; Joshi, A.; Mandlekar, A.; Zhu, Y. RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots. In Proceedings of the Robotics: Science and Systems, 2024.
- Bao, C.; Xu, H.; Qin, Y.; Wang, X. DexArt: Benchmarking Generalizable Dexterous Manipulation with Articulated Objects. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 21190–21200.
- Mitash, C.; Wang, F.; Lu, S.; Terhuja, V.; Garaas, T.; Polido, F.; Nambi, M. ARMBench: An Object-centric Benchmark Dataset for Robotic Manipulation. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 9132–9139.
- Tunyasuvunakool, S.; Muldal, A.; Doron, Y.; Liu, S.; Bohez, S.; Merel, J.; Erez, T.; Lillicrap, T.; Heess, N.; Tassa, Y. dm_control: Software and tasks for continuous control. Software Impacts 2020, 6, 100022. [CrossRef]
- Gupta, A.; Kumar, V.; Lynch, C.; Levine, S.; Hausman, K. Relay Policy Learning: Solving Long-Horizon Tasks via Imitation and Reinforcement Learning. In Proceedings of the Conference on Robot Learning. PMLR, 2020, pp. 1025–1037.
- McLean, R.; Chatzaroulas, E.; McCutcheon, L.; Röder, F.; Yu, T.; He, Z.; Zentner, K.; Julian, R.; Terry, J.K.; Woungang, I.; et al. Meta-World+: An Improved, Standardized, RL Benchmark. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025.
- Wang, X.; Lian, L.; Yu, S.X. Unsupervised visual attention and invariance for reinforcement learning. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 6677–6687.
- Henderson, P.; Chang, W.D.; Shkurti, F.; Hansen, J.; Meger, D.; Dudek, G. Benchmark environments for multitask learning in continuous domains. arXiv preprint arXiv:1708.04352 2017.
- Gu, J.; Xiang, F.; Li, X.; Ling, Z.; Liu, X.; Mu, T.; Tang, Y.; Tao, S.; Wei, X.; Yao, Y.; et al. ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills. In Proceedings of the International Conference on Learning Representations, 2023.
- Vittorio, C.; Huawei, W.; Guillaume, D.; Massimo, S.; Vikash, K. MyoSuite – A contact-rich simulation suite for musculoskeletal motor control. https://github.com/myohub/myosuite, 2022.
- contributors, A.W.C. AgiBot World Colosseum. https://github.com/OpenDriveLab/AgiBot-World, 2024.
- Mandlekar, A.; Xu, D.; Wong, J.; Nasiriany, S.; Wang, C.; Kulkarni, R.; Fei-Fei, L.; Savarese, S.; Zhu, Y.; Martín-Martín, R. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation. In Proceedings of the Conference on Robot Learning (CoRL), 2021.
- Dasari, S.; Ebert, F.; Tian, S.; Nair, S.; Bucher, B.; Schmeckpeper, K.; Singh, S.; Levine, S.; Finn, C. RoboNet: Large-Scale Multi-Robot Learning. In Proceedings of the CoRL 2019: Volume 100 Proceedings of Machine Learning Research, 2019, [arXiv:cs.RO/1910.11215].
- Fang, H.S.; Fang, H.; Tang, Z.; Liu, J.; Wang, J.; Zhu, H.; Lu, C. RH20T: A Robotic Dataset for Learning Diverse Skills in One-Shot. In Proceedings of the RSS 2023 Workshop on Learning for Task and Motion Planning, 2023.
- Heo, M.; Lee, Y.; Lee, D.; Lim, J.J. FurnitureBench: Reproducible Real-World Benchmark for Long-Horizon Complex Manipulation. In Proceedings of the Robotics: Science and Systems, 2023. [CrossRef]
- Geng, H.; Wang, F.; Wei, S.; Li, Y.; Wang, B.; An, B.; Cheng, C.T.; Lou, H.; Li, P.; Wang, Y.J.; et al. RoboVerse: Towards a Unified Platform, Dataset and Benchmark for Scalable and Generalizable Robot Learning, 2025, [arXiv:cs.RO/2504.18904].
- Yan, F.; Liu, F.; Huang, Y.; Guan, Z.; Zheng, L.; Zhong, Y.; Feng, C.; Ma, L. RoboTron-Mani: All-in-One Multimodal Large Model for Robotic Manipulation. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025, pp. 13707–13718.
- Bharadhwaj, H.; Vakil, J.; Sharma, M.; Gupta, A.; Tulsiani, S.; Kumar, V. RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking, 2023, [arXiv:cs.RO/2309.01918].
- Yang, R.; Chen, H.; Zhang, J.; Zhao, M.; Qian, C.; Wang, K.; Wang, Q.; Koripella, T.V.; Movahedi, M.; Li, M.; et al. EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents. In Proceedings of the Forty-second International Conference on Machine Learning, 2025.
- Li, C.; Zhang, R.; Wong, J.; Gokmen, C.; Srivastava, S.; Martín-Martín, R.; Wang, C.; Levine, G.; Lingelbach, M.; Sun, J.; et al. Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation. In Proceedings of the Conference on Robot Learning. PMLR, 2023, pp. 80–93.
- Fu, J.; Kumar, A.; Nachum, O.; Tucker, G.; Levine, S. D4RL: Datasets for Deep Data-Driven Reinforcement Learning, 2020, [arXiv:cs.LG/2004.07219].
- Savva, M.; Kadian, A.; Maksymets, O.; Zhao, Y.; Wijmans, E.; Jain, B.; Straub, J.; Liu, J.; Koltun, V.; Malik, J.; et al. Habitat: A platform for embodied ai research. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9339–9347.
- Chang, A.; Dai, A.; Funkhouser, T.; Halber, M.; Niebner, M.; Savva, M.; Song, S.; Zeng, A.; Zhang, Y. Matterport3D: Learning from RGB-D Data in Indoor Environments. In Proceedings of the 2017 International Conference on 3D Vision (3DV). IEEE Computer Society, 2017, pp. 667–676. [CrossRef]
- Yadav, K.; Ramrakhya, R.; Ramakrishnan, S.K.; Gervet, T.; Turner, J.; Gokaslan, A.; Maestre, N.; Chang, A.X.; Batra, D.; Savva, M.; et al. Habitat-matterport 3d semantics dataset. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4927–4936.
- Xia, F.; Zamir, A.R.; He, Z.; Sax, A.; Malik, J.; Savarese, S. Gibson env: Real-world perception for embodied agents. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 9068–9079.
- Martin-Martin, R.; Patel, M.; Rezatofighi, H.; Shenoi, A.; Gwak, J.; Frankel, E.; Sadeghian, A.; Savarese, S. Jrdb: A dataset and benchmark of egocentric robot visual perception of humans in built environments. IEEE transactions on pattern analysis and machine intelligence 2021, 45, 6748–6765. [CrossRef]
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2012. [CrossRef]
- Shang, Y.; Li, Z.; Ma, Y.; Su, W.; Jin, X.; Wang, Z.; Jin, L.; Zhang, X.; Tang, Y.; Su, H.; et al. WorldArena: A Unified Benchmark for Evaluating Perception and Functional Utility of Embodied World Models. arXiv preprint arXiv:2602.08971 2026.
- Chen, T.; Chen, Z.; Chen, B.; Cai, Z.; Liu, Y.; Li, Z.; Liang, Q.; Lin, X.; Ge, Y.; Gu, Z.; et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation. arXiv preprint arXiv:2506.18088 2025.
- Sun, P.; Kretzschmar, H.; Dotiwalla, X.; Chouard, A.; Patnaik, V.; Tsui, P.; Guo, J.; Zhou, Y.; Chai, Y.; Caine, B.; et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2446–2454.
- Yang, J.; Gao, S.; Qiu, Y.; Chen, L.; Li, T.; Dai, B.; Chitta, K.; Wu, P.; Zeng, J.; Luo, P.; et al. Generalized Predictive Model for Autonomous Driving. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
- Zhou, T.; Tucker, R.; Flynn, J.; Fyffe, G.; Snavely, N. Stereo magnification: learning view synthesis using multiplane images. ACM Transactions on Graphics (TOG) 2018, 37, 1–12.
- Wang, Y.; Cheng, K.; He, J.; Wang, Q.; Dai, H.; Chen, Y.; Xia, F.; Zhang, Z. Drivingdojo dataset: Advancing interactive and knowledge-enriched driving world model. Advances in Neural Information Processing Systems 2024, 37, 13020–13034.
- Min, C.; Zhao, D.; Xiao, L.; Zhao, J.; Xu, X.; Zhu, Z.; Jin, L.; Li, J.; Guo, Y.; Xing, J.; et al. Driveworld: 4d pre-trained scene understanding via world models for autonomous driving. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 15522–15533.
- Beattie, C.; Leibo, J.Z.; Teplyashin, D.; Ward, T.; Wainwright, M.; Küttler, H.; Lefrancq, A.; Green, S.; Valdés, V.; Sadik, A.; et al. Deepmind lab. arXiv preprint arXiv:1612.03801 2016.
- Cobbe, K.; Hesse, C.; Hilton, J.; Schulman, J. Leveraging Procedural Generation to Benchmark Reinforcement Learning. arXiv preprint arXiv:1912.01588 2019.
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 2013.
- Osband, I.; Doron, Y.; Hessel, M.; Aslanides, J.; Sezener, E.; Saraiva, A.; McKinney, K.; Lattimore, T.; Szepesvári, C.; Singh, S.; et al. Behaviour Suite for Reinforcement Learning. In Proceedings of the International Conference on Learning Representations, 2020.
- Paperno, D.; Kruszewski, G.; Lazaridou, A.; Pham, N.Q.; Bernardi, R.; Pezzelle, S.; Baroni, M.; Boleda, G.; Fernández, R. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers), 2016, pp. 1525–1534.
- Zhang, J.; Jiang, M.; Dai, N.; Lu, T.; Uzunoglu, A.; Zhang, S.; Wei, Y.; Wang, J.; Patel, V.M.; Liang, P.P.; et al. World-in-World: World Models in a Closed-Loop World. arXiv preprint arXiv:2510.18135 2025.
- Bordes, F.; Garrido, Q.; Kao, J.T.; Williams, A.; Rabbat, M.; Dupoux, E. IntPhys 2: Benchmarking Intuitive Physics Understanding In Complex Synthetic Environments. arXiv preprint arXiv:2506.09849 2025.
- Weihs, L.; Yuile, A.R.; Baillargeon, R.; Fisher, C.; Marcus, G.; Mottaghi, R.; Kembhavi, A. Benchmarking Progress to Infant-Level Physical Reasoning in AI. TMLR 2022.
- Goyal, R.; Ebrahimi Kahou, S.; Michalski, V.; Materzynska, J.; Westphal, S.; Kim, H.; Haenel, V.; Fruend, I.; Yianilos, P.; Mueller-Freitag, M.; et al. The" something something" video database for learning and evaluating visual common sense. In Proceedings of the Proceedings of the IEEE international conference on computer vision, 2017, pp. 5842–5850.
- Damen, D.; Doughty, H.; Farinella, G.M.; Fidler, S.; Furnari, A.; Kazakos, E.; Moltisanti, D.; Munro, J.; Perrett, T.; Price, W.; et al. The epic-kitchens dataset: Collection, challenges and baselines. IEEE Transactions on Pattern Analysis and Machine Intelligence 2020, 43, 4125–4141. [CrossRef]
- NVIDIA PhysX SDK. https://developer.nvidia.com/physx-sdk, 2025. Accessed: 2025-11-15.
- Dosovitskiy, A.; Ros, G.; Codevilla, F.; Lopez, A.; Koltun, V. CARLA: An open urban driving simulator. In Proceedings of the Conference on robot learning. PMLR, 2017, pp. 1–16.
- Greff, K.; Belletti, F.; Beyer, L.; Doersch, C.; Du, Y.; Duckworth, D.; Fleet, D.J.; Gnanapragasam, D.; Golemo, F.; Herrmann, C.; et al. Kubric: a scalable dataset generator 2022.
- PyBullet Physics Simulation. https://pybullet.org, 2025. Accessed: 2025-11-15.
- Blender – a 3D modelling and rendering package. https://www.blender.org, 2025. Accessed: 2025-11-15.
- Zhou, H.; Ma, Y.; Wu, H.; Wang, H.; Long, M. Unisolver: PDE-Conditional Transformers Towards Universal Neural PDE Solvers. In Proceedings of the Forty-second International Conference on Machine Learning, 2025.
- Pătrăucean, V.; Smaira, L.; Gupta, A.; Continente, A.R.; Markeeva, L.; Banarse, D.; Koppula, S.; Heyward, J.; Malinowski, M.; Yang, Y.; et al. Perception Test: A Diagnostic Benchmark for Multimodal Video Models. In Proceedings of the Advances in Neural Information Processing Systems, 2023.
- Bear, D.; Wang, E.; Mrowca, D.; Binder, F.J.; Tung, H.Y.; Pramod, R.; Holdaway, C.; Tao, S.; Smith, K.A.; Sun, F.Y.; et al. Physion: Evaluating Physical Prediction from Vision in Humans and Machines. In Proceedings of the Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021.
- Qiu, S.; Guo, S.; Song, Z.Y.; Sun, Y.; Cai, Z.; Wei, J.; Luo, T.; Yin, Y.; Zhang, H.; Hu, Y.; et al. PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models, 2025, [arXiv:cs.CL/2504.16074].
- Chow, W.; Mao, J.; Li, B.; Seita, D.; Guizilini, V.C.; Wang, Y. PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding. In Proceedings of the The Thirteenth International Conference on Learning Representations, 2025.
- Gamerdinger, J.; Teufel, S.; Schulz, P.; Amann, S.; Kirchner, J.P.; Bringmann, O. Scope: A synthetic multi-modal dataset for collective perception including physical-correct weather conditions. In Proceedings of the 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2024, pp. 2622–2628.
- Xing, E.; Deng, M.; Hou, J.; Hu, Z. Critiques of world models. arXiv preprint arXiv:2507.05169 2025.
- Miao, J.; Wei, Y.; Wu, Y.; Liang, C.; Li, G.; Yang, Y. VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021.
- Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213–3223. [CrossRef]
- Liu, H.; He, J.; Jin, Y.; Zheng, D.; Dong, Y.; Zhang, F.; Huang, Z.; He, Y.; Li, Y.; Chen, W.; et al. ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models, 2025, [arXiv:cs.CV/2506.21356].
- Xue, W.; Qian, C.; Wu, J.; Zhou, Y.; Liu, W.; Ren, J.; Fan, S.; Zhang, Y. ShotVL: Human-Centric Highlight Frame Retrieval via Language Queries. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2025, Vol. 39, pp. 9050–9058. [CrossRef]
- Zhao, E.; Raval, V.; Zhang, H.; Mao, J.; Shangguan, Z.; Nikolaidis, S.; Wang, Y.; Seita, D. ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation. In Proceedings of the Proceedings of The 9th Conference on Robot Learning; Lim, J.; Song, S.; Park, H.W., Eds. PMLR, 27–30 Sep 2025, Vol. 305, Proceedings of Machine Learning Research, pp. 3413–3462.
- Anderson, P.; Wu, Q.; Teney, D.; Bruce, J.; Johnson, M.; Sünderhauf, N.; Reid, I.; Gould, S.; van den Hengel, A. Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Ray, A.; Duan, J.; Brown, E.; Tan, R.; Bashkirova, D.; Hendrix, R.; Ehsani, K.; Kembhavi, A.; Plummer, B.A.; Krishna, R.; et al. SAT: Dynamic Spatial Aptitude Training for Multimodal Language Models, 2025, [arXiv:cs.CV/2412.07755].
- Kamath, A.; Hessel, J.; Chang, K.W. What’s “up” with vision-language models? investigating their struggle with spatial reasoning. In Proceedings of the Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 9161–9175.
- Zhao, Z.; Fu, H.; Liang, D.; Zhou, X.; Zhang, D.; Xie, H.; Wang, B.; Bai, X. Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving. arXiv preprint arXiv:2505.08725 2025.
- Wang, S.; Yu, Z.; Jiang, X.; Lan, S.; Shi, M.; Chang, N.; Kautz, J.; Li, Y.; Alvarez, J.M. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. In Proceedings of the Proceedings of the computer vision and pattern recognition conference, 2025, pp. 22442–22452.
- Zhou, Y.; Wang, Y.; Zhou, J.; Chang, W.; Guo, H.; Li, Z.; Ma, K.; Li, X.; Wang, Y.; Zhu, H.; et al. OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling, 2025, [arXiv:cs.CV/2509.12201].
- Chen, D.; Chung, W.; Bang, Y.; Ji, Z.; Fung, P. WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning. In Proceedings of the ICML 2025 Workshop on Assessing World Models, 2025.
- Gu, J.; Liu, X.; Zeng, Y.; Nagarajan, A.; Zhu, F.; Hong, D.; Fan, Y.; Yan, Q.; Zhou, K.; Liu, M.Y.; et al. " PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models. arXiv preprint arXiv:2507.13428 2025.
- Li, D.; Fang, Y.; Chen, Y.; Yang, S.; Cao, S.; Wong, J.; Luo, M.; Wang, X.; Yin, H.; Gonzalez, J.E.; et al. WorldModelBench: Judging Video Generation Models As World Models. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025.
- Li, Z.; Li, C.; Mao, X.; Lin, S.; Li, M.; Zhao, S.; Li, X.; Feng, Y.; Sun, J.; Li, Z.; et al. Sekai: A Video Dataset towards World Exploration. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025.
- Azzolini, A.; Bai, J.; Brandon, H.; Cao, J.; Chattopadhyay, P.; Chen, H.; Chu, J.; Cui, Y.; Diamond, J.; Ding, Y.; et al. Cosmos-reason1: From physical common sense to embodied reasoning. arXiv preprint arXiv:2503.15558 2025.
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 2004, 13, 600–612. [CrossRef]
- Nacken, P.F. Chamfer metrics, the medial axis and mathematical morphology. Journal of Mathematical Imaging and Vision 1996, 6, 235–248. [CrossRef]
- Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 2017, 30.
- Unterthiner, T.; Van Steenkiste, S.; Kurach, K.; Marinier, R.; Michalski, M.; Gelly, S. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717 2018.
- Stein, G.; Cresswell, J.; Hosseinzadeh, R.; Sui, Y.; Ross, B.; Villecroze, V.; Liu, Z.; Caterini, A.L.; Taylor, E.; Loaiza-Ganem, G. Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models. Advances in Neural Information Processing Systems 2023, 36, 3732–3784.
- Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586–595.
- Fu, S.; Tamir, N.Y.; Sundaram, S.; Chai, L.; Zhang, R.; Dekel, T.; Isola, P. DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data. In Proceedings of the Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Liu, J.; Qu, Y.; Yan, Q.; Zeng, X.; Wang, L.; Liao, R. Fréchet Video Motion Distance: A Metric for Evaluating Motion Consistency in Videos. In Proceedings of the First Workshop on Controllable Video Generation@ ICML24, 2024.
- Hessel, J.; Holtzman, A.; Forbes, M.; Le Bras, R.; Choi, Y. Clipscore: A reference-free evaluation metric for image captioning. In Proceedings of the Proceedings of the 2021 conference on empirical methods in natural language processing, 2021, pp. 7514–7528.
- Hentschel, S.; Kobs, K.; Hotho, A. CLIP knows image aesthetics. Frontiers in Artificial Intelligence 2022, 5, 976235. [CrossRef]
- Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660.
- Huang, Z.; He, Y.; Yu, J.; Zhang, F.; Si, C.; Jiang, Y.; Zhang, Y.; Wu, T.; Jin, Q.; Chanpaisit, N.; et al. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 21807–21818.
- Duan, H.; Yu, H.X.; Chen, S.; Fei-Fei, L.; Wu, J. Worldscore: A unified evaluation benchmark for world generation. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 27713–27724.
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318.
- Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 65–72.
- Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text summarization branches out, 2004, pp. 74–81.
- Vedantam, R.; Lawrence Zitnick, C.; Parikh, D. Cider: Consensus-based image description evaluation. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4566–4575.
- Stolcke, A.; Yoshioka, T. DOVER: A method for combining diarization outputs. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 757–763.
- Sutton, R.S.; Barto, A.G.; et al. Reinforcement learning: An introduction; Vol. 1, MIT press Cambridge, 1998.
- Kendall, M.G. A new measure of rank correlation. Biometrika 1938, 30, 81–93.
- Elo, A.E. The Rating of Chessplayers, Past and Present; Arco Publishing: New York, 1978.
- Badia, A.P.; Piot, B.; Kapturowski, S.; Sprechmann, P.; Vitvitskyi, A.; Guo, Z.D.; Blundell, C. Agent57: Outperforming the atari human benchmark. In Proceedings of the International conference on machine learning. PMLR, 2020, pp. 507–517.
- Christen, P.; Hand, D.J.; Kirielle, N. A review of the F-measure: its history, properties, criticism, and alternatives. ACM Computing Surveys 2023, 56, 1–24. [CrossRef]
- Gretton, A.; Bousquet, O.; Smola, A.; Schölkopf, B. Measuring statistical dependence with Hilbert-Schmidt norms. In Proceedings of the International conference on algorithmic learning theory. Springer, 2005, pp. 63–77.
- Botchkarev, A. Performance metrics (error measures) in machine learning regression, forecasting and prognostics: Properties and typology. arXiv preprint arXiv:1809.03006 2018.
- Stuart Jr, H.W. Value gaps and profitability. Strategy Science 2016, 1, 56–70.
- Shi, Z.; Liu, M.; Zhang, S.; Zheng, R.; Dong, S.; Wei, P. GAWM: Global-Aware World Model for Multi-Agent Reinforcement Learning. arXiv preprint arXiv:2501.10116 2025.
- Lambert, N.; Pister, K.; Calandra, R. Investigating compounding prediction errors in learned dynamics models. arXiv preprint arXiv:2203.09637 2022.
- Xu, Y.; Parker-Holder, J.; Pacchiano, A.; Ball, P.; Rybkin, O.; Roberts, S.; Rocktäschel, T.; Grefenstette, E. Learning general world models in a handful of reward-free deployments. Advances in Neural Information Processing Systems 2022, 35, 26820–26838.
- Qin, C.; Klabjan, D.; Russo, D. Improving the expected improvement algorithm. Advances in Neural Information Processing Systems 2017, 30.
- Prakash, A.; Tu, R.; Chang, M.; Gupta, S. 3d hand pose estimation in everyday egocentric images. In Proceedings of the European Conference on Computer Vision. Springer, 2024, pp. 183–202.
- Bento, J.; Zhu, J.J. A metric for sets of trajectories that is practical and mathematically consistent. arXiv preprint arXiv:1601.03094 2016.
- Sturm, J.; Burgard, W.; Cremers, D. Evaluating egomotion and structure-from-motion approaches using the TUM RGB-D benchmark. In Proceedings of the Proc. of the Workshop on Color-Depth Camera Fusion in Robotics at the IEEE/RJS International Conference on Intelligent Robot Systems (IROS), 2012, Vol. 13, p. 6.
- Mohamed, A.; Zhu, D.; Vu, W.; Elhoseiny, M.; Claudel, C. Social-implicit: Rethinking trajectory prediction evaluation and the effectiveness of implicit maximum likelihood estimation. In Proceedings of the European Conference on Computer Vision. Springer, 2022, pp. 463–479.
- Perille, D.; Truong, A.; Xiao, X.; Stone, P. Benchmarking metric ground navigation. In Proceedings of the 2020 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR). IEEE, 2020, pp. 116–121.
- Georgiou, T.T.; Smith, M.C. Optimal robustness in the gap metric. In Proceedings of the Proceedings of the 28th IEEE Conference on Decision and Control,. IEEE, 1989, pp. 2331–2336.
- Ward, J.R.; Agamennoni, G.; Worrall, S.; Bender, A.; Nebot, E. Extending time to collision for probabilistic reasoning in general traffic scenarios. Transportation Research Part C: Emerging Technologies 2015, 51, 66–82.
- Senin, P. Dynamic time warping algorithm review. Information and Computer Science Department University of Hawaii at Manoa Honolulu, USA 2008, 855, 40.
- Zhang, Y.; Mehta, S.; Caspi, A. Rethinking semantic segmentation evaluation for explainability and model selection. arXiv preprint arXiv:2101.08418 2021.
- Steinley, D.; Brusco, M.J.; Hubert, L. The variance of the adjusted Rand index. Psychological methods 2016, 21, 261. [CrossRef]
- Anderson, P.; Fernando, B.; Johnson, M.; Gould, S. Spice: Semantic propositional image caption evaluation. In Proceedings of the European conference on computer vision. Springer, 2016, pp. 382–398.
- Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. Advances in neural information processing systems 2014, 27.
- Unterthiner, T.; Van Steenkiste, S.; Kurach, K.; Marinier, R.; Michalski, M.; Gelly, S. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717 2018.
- Tsamardinos, I.; Brown, L.E.; Aliferis, C.F. The max-min hill-climbing Bayesian network structure learning algorithm. Machine learning 2006, 65, 31–78. [CrossRef]
- Belongie, S.; Malik, J.; Puzicha, J. Shape matching and object recognition using shape contexts. IEEE transactions on pattern analysis and machine intelligence 2002, 24, 509–522. [CrossRef]
- Bordes, A.; Usunier, N.; Garcia-Duran, A.; Weston, J.; Yakhnenko, O. Translating embeddings for modeling multi-relational data. Advances in neural information processing systems 2013, 26.
- Wang, M.; Jin, W.; Cao, K.; Xie, L.; Hong, Y. ContactGaussian-WM: Learning Physics-Grounded World Model from Videos. arXiv preprint arXiv:2602.11021 2026.
- Makoviychuk, V.; Wawrzyniak, L.; Guo, Y.; Lu, M.; Storey, K.; Macklin, M.; Hoeller, D.; Rudin, N.; Allshire, A.; Handa, A. Isaac Gym: High Performance GPU Based Physics Simulation For Robot Learning. arXiv preprint arXiv:2110.13563 2021. NeurIPS 2021 Track: Datasets and Benchmarks.
- Bengio, Y.; Clare, S.; Prunkl, C.; Andriushchenko, M.; Bucknall, B.; Murray, M.; Bommasani, R.; Casper, S.; Davidson, T.; Douglas, R.; et al. International ai safety report 2026. arXiv preprint arXiv:2602.21012 2026.


Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).



