Submitted:
13 June 2026
Posted:
17 June 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. Scope of Evidence
1.2. Contributions

1.3. Organization

1.4. Answer in Brief
2. Scope and Method
2.1. Inclusion Criteria and Search Protocol
2.2. Terminology and the Shape of the Question
2.3. Limitations of This Survey
2.4. Relation to Prior Surveys
3. Background
3.1. Problem Formulations
3.2. Policy Classes in Brief
3.3. Evaluation Protocols
3.4. The Compounding-Error Debate and the Case for Interaction
4. The Model Axis
4.1. Vision-Language-Action Architectures
| System | Year | Params | Action head | Primary training data |
| RT-1 [1] | 2022 | 35M | discrete tokens | 130k teleop episodes |
| RT-2 [4] | 2023 | 12B/55B | discrete tokens | web VQA + RT-1 data |
| Diffusion Policy [6] | 2023 | ∼100M | diffusion chunks | per-task demos |
| ACT [31] | 2023 | ∼80M | chunked regression | 50 demos/task |
| Octo [60] | 2024 | 93M | diffusion chunks | 800k OXE trajectories |
| OpenVLA [5] | 2024 | 7B | discrete tokens | 970k OXE trajectories |
| [7] | 2024 | 3.3B | flow matching | cross-embodiment collection + OXE |
| RDT-1B [33] | 2024 | 1.2B | diffusion | 1M+ multi-robot episodes |
| CogACT [63] | 2024 | 7B+ | diffusion | OXE subset |
| OpenVLA-OFT [32] | 2025 | 7B | parallel continuous | LIBERO / ALOHA fine-tunes |
| GR00T N1 [34] | 2025 | 2B | diffusion transformer | robot + human video + synthetic |
| [47] | 2025 | – | flow + subtask tokens | heterogeneous co-training |
| Gemini Robotics [18] | 2025 | – | – | proprietary |
4.2. World Models
4.3. World Models Beyond Manipulation: Evidence of a General Program
4.4. Generative Video and 3D Scene World Models
4.5. Hierarchical and Agentic Systems
4.6. Instruction Following and Language Grounding
4.7. Reinforcement-Learning Post-Training
| Method | Algorithm | Substrate | Base policy | Headline evidence |
| VLA-RL [142] | PPO + process reward | simulator | OpenVLA | gains over SFT on LIBERO |
| SimpleVLA-RL [10] | GRPO-style | simulator | OpenVLA-OFT | 17.3→91.7% from 1 demo/task |
| RIPT-VLA [17] | interactive, dyn. sampling | simulator | OpenVLA-OFT | low-data stabilization |
| GRAPE [145] | trajectory preference | offline | OpenVLA | success + safety objectives |
| [143] | on-policy for flow heads | simulator | -class | RL for flow policies |
| RobustVLA [144] | robustness-aware RL | simulator | OpenVLA-class | perturbation-targeted reward |
| V-GPS [146] | value re-ranking | inference only | Octo/OpenVLA | gains without weight updates |
| World4RL [40] | RL in diffusion WM | learned model | pretrained policies | refinement without simulator |
| World-Gymnast [93] | RL in video WM | learned model | BC base | physical-interaction-free RL |
4.8. Cross-Embodiment Transfer and Action-Space Unification
4.9. Robustness Mechanisms
4.10. Efficiency as a Model-Axis Variable
4.11. What the Model Axis Buys
5. The Data Axis
5.1. Real-Robot Demonstration Datasets
| Dataset | Year | Scale | Embodiments | Collection | Documented weakness |
| Real-robot datasets | |||||
| RoboNet [50] | 2019 | 15M frames | 7 arms | scripted | weak action semantics |
| RT-1 [1] | 2022 | 130k traj | 1 (Everyday Robot) | teleop | single site, single emb |
| BridgeData V2 [169] | 2023 | 60k traj | 1 (WidowX) | teleop | toy-scale objects, 1 emb |
| RH20T [171] | 2023 | 110k+ traj, 40+h | 4 arms | teleop + force | short horizons |
| OXE [2] | 2023 | 1M+ traj | 22 | aggregation | heterogeneous quality, schema drift |
| DROID [170] | 2024 | 76k traj, 350h, 564 scenes | 1 (Franka) | federated teleop | policy results initially weak in-distribution |
| RoboMIND [151] | 2024 | ∼107k traj | 4 | teleop | lab scenes |
| AgiBot World [3] | 2025 | 1M+ traj, 217 tasks | 1 fleet | factory teleop | single platform family |
| RoboCOIN [152] | 2025 | bimanual, multi-emb | many | consortium teleop | recency, uneven density |
| Kaiwu [173] | 2025 | multimodal episodes | 1 cell | teleop + tactile/audio | scale |
| Open-H [174] | 2026 | large, medical | surgical | consortium | domain-specific |
| Human and egocentric video | |||||
| Ego4D [175] | 2022 | 3,670h | human | worn cameras | no actions, no robot morphology |
| Ego-Exo4D [176] | 2024 | 1,286h skilled | human | multi-view | same |
| EgoDex [177] | 2025 | ∼800h + 3D hands | human | AVP capture | retargeting gap |
| EgoLive [178] | 2026 | large, task-oriented | human | head-mounted | recency |
| Simulation suites and generated datasets | |||||
| Meta-World [179] | 2019 | 50 tasks | 1 (Sawyer) | scripted | state obs, no language |
| RLBench [180] | 2019 | 100 tasks | 1 (Franka) | planner | rendering realism |
| CALVIN [51] | 2022 | 34 tasks, play data | 1 (Franka) | teleop play | 4 fixed scenes |
| ManiSkill2/3 [181,182] | 2023–24 | 20+ families, 2k+ objects | several | planner/RL | object-centric, short tasks |
| LIBERO [12] | 2023 | 130 tasks × 50 demos | 1 (Franka) | teleop | one scene/instruction per task; memorization-prone [13] |
| MimicGen [183] | 2023 | 50k demos from ∼200 | several | auto-synthesis | inherits seed-demo biases |
| BEHAVIOR-1K [184] | 2024 | 1,000 activities | several | sampled | evaluation cost |
| RoboCasa [185] | 2024 | 100 tasks, 2.5k+ assets | several | MimicGen-expanded | kitchen-domain bound |
| DexMimicGen [186] | 2025 | bimanual dexterous | humanoid hands | auto-synthesis | same |
5.2. Human and Egocentric Video
5.3. Simulation
5.4. Lessons from the Flagship Collections
5.5. Documentation, Licensing, and the Missing Metadata
5.6. Generated and Augmented Data
5.7. Collection Economics
5.8. Data Scaling Evidence

5.9. Quality Versus Quantity
5.10. What the Data Axis Buys
6. The Evaluation Crisis
6.1. Benchmark Inflation and the Memorization Diagnosis

6.2. Why Standard Protocols Could Not Detect It
6.3. Sim-to-Real Validity and Learned Evaluators
6.4. Statistical Power and the Reproducibility Layer
| Instrument | Substrate | Perturbation axes | Validity evidence |
| LIBERO [12] | sim (MuJoCo) | none (standard protocol) | exposed by [13] |
| CALVIN [51] | sim | env split A–D | long-horizon chains |
| RoboCasa [185] | sim | scene/object sampling | – |
| THE COLOSSEUM [15] | sim (RLBench) | 14 factors | sim-real ranking correlation |
| LIBERO-PRO [13] | sim | 4–5 axes | memorization probes |
| LIBERO-Plus [14] | sim | 7 factors | factor decomposition |
| LIBERO-X [209] | sim | hierarchical, cumulative | – |
| GenManip [201] | sim (Isaac) | LLM-generated scenes | human-in-loop audit |
| SimplerEnv [52] | calibrated sim | visual matching | explicit real correlation |
| World-model evaluators [41,42] | learned | arbitrary in principle | uncalibrated |
6.5. What Good Evaluation Requires
7. Evidence Synthesis: Data Versus Models
7.1. The Evidence Matrix
| Finding | Supports | Sources | Scope limitation |
| Generalization scales as a power law in environment/object diversity; per-environment volume saturates | data (diversity) | [205] | UMI tasks; single embodiment |
| Cross-embodiment aggregation improves underrepresented domains ∼50% | data (pooling) | [2] | evaluation on contributing labs |
| Mixture reweighting changes success without new data | data (composition) | [154] | OXE-scale only |
| Demonstration quality outweighs quantity on fixed tasks | data (quality) | [207,208] | small-scale, pre-VLA |
| Action-interface redesign lifts LIBERO 76.5→97.1% on identical data | model (architecture) | [32] | standard protocol; gap unprobed |
| Web-pretrained backbones transfer semantics to control | model (representation) | [4,5] | semantic, not motoric, transfer |
| RL post-training improves robustness over SFT on matched data; PPO > DPO/GRPO | model (objective) | [10,11] | LIBERO-family sims |
| One demo + RL reaches 91.7% where SFT yields 17.3% | model (objective) | [10] | needs competent base; sim |
| 3D structure buys viewpoint robustness | model (inductive bias) | [14,132,158] | axis-specific |
| Hierarchy buys semantic-perturbation retention only | model (decomposition) | [14,201] | motor brittleness persists |
| ≥90% standard success collapses to 0% under perturbation | evaluation | [13,15] | train/test coincidence is the cause |
| Co-training mixtures yield open-world generalization | confounded | [34,47] | data and model changed together |
| Generated data lifts unseen-verb/environment success | confounded | [43,194] | generator is itself a model |

7.2. Identified Confounds
7.3. The Conditional Answer
7.4. A Failure Taxonomy and Likely Interventions
7.5. A Decision Framework for Practitioners
7.6. Threats to the Validity of This Synthesis
8. Bottlenecks
8.1. Data
8.2. Models
8.3. Evaluation
8.4. Process
9. Future Directions
| # | Direction | Hypothesis under test | Instrument | Cost |
| 1 | World-model isolation experiment | predictive objectives buy shift retention | LIBERO-PRO/Plus, held-out axes | low |
| 2 | Counterfactual dataset structure | layout-instruction confounds cause memorization | retrofitted LIBERO; new datasets | low–mid |
| 3 | Diversity-adjusted cost model | diversity, not volume, prices generalization | collection logs + [205] | low |
| 4 | Sealed-axis benchmarks | public perturbations can be overfit | rotating perturbation pools | mid |
| 5 | Calibrated learned evaluators | world models can rank policies faithfully | SimplerEnv-style correlation | mid |
| 6 | Generated-data attribution | generator artifacts propagate to policies | attribution tooling | mid |
| 7 | Unified world-action training | joint predictor-policy beats both alone | factor-controlled manipulation | high |
| 8 | Counterfactual training monitors | loss of language sensitivity is observable online | instruction-divergence probes | low |
| 9 | Contact-data scaling study | tactile diversity scales like visual diversity | force-instrumented collection | high |
| 10 | Safety-aware perturbation evaluation | robustness gains may trade against safety | joint success-safety reporting | mid |
10. Conclusions
References
- Brohan, A.; et al. RT-1: Robotics Transformer for Real-World Control at Scale. arXiv 2022, arXiv:2212.06817. [Google Scholar]
- O’Neill, A.; et al. Open X-Embodiment: Robotic Learning Datasets and RT-X Models. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2024. [Google Scholar]
- Bu, Q.; et al. AgiBot World Colosseo: A Large-Scale Manipulation Platform for Scalable and Intelligent Embodied Systems. arXiv 2025, arXiv:2503.06669. [Google Scholar]
- Brohan, A.; et al. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. In Proceedings of the Conference on Robot Learning (CoRL), 2023. [Google Scholar]
- Kim, M.J.; et al. OpenVLA: An Open-Source Vision-Language-Action Model. In Proceedings of the Conference on Robot Learning (CoRL), 2024. [Google Scholar]
- Chi, C.; et al. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. In Proceedings of the Robotics: Science and Systems (RSS), 2023. [Google Scholar]
- Black, K.; et al. π0: A Vision-Language-Action Flow Model for General Robot Control. arXiv 2024, arXiv:2410.24164. [Google Scholar]
- Hafner, D.; Pasukonis, J.; Ba, J.; Lillicrap, T. Mastering Diverse Control Tasks through World Models. Nature 2025, 640, 647–653. [Google Scholar] [CrossRef] [PubMed]
- Assran, M.; et al. V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning. arXiv 2025, arXiv:2506.09985. [Google Scholar]
- Li, H.; et al. SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning. arXiv 2025, arXiv:2509.09674. [Google Scholar]
- Liu, J.; et al. What Can RL Bring to VLA Generalization? An Empirical Study. arXiv 2025, arXiv:2505.19789. [Google Scholar]
- Liu, B.; et al. LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks, 2023. [Google Scholar]
- Zhou, X.; et al. LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization. arXiv 2025, arXiv:2510.03827. [Google Scholar]
- Fei, S.; et al. LIBERO-Plus: In-Depth Robustness Analysis of Vision-Language-Action Models. arXiv 2025, arXiv:2510.13626. [Google Scholar]
- Pumacay, W.; Singh, I.; Duan, J.; Krishna, R.; Thomason, J.; Fox, D. THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation. In Proceedings of the Robotics: Science and Systems (RSS), 2024. [Google Scholar]
- Hu, A.; et al. GAIA-1: A Generative World Model for Autonomous Driving. arXiv 2023, arXiv:2309.17080. [Google Scholar]
- Tan, S.; Dou, K.; Zhao, Y.; Krähenbühl, P. Interactive Post-Training for Vision-Language-Action Models. arXiv 2025, arXiv:2505.17016. [Google Scholar]
- Gemini Robotics Team; et al. Gemini Robotics: Bringing AI into the Physical World. arXiv 2025, arXiv:2503.20020. [Google Scholar]
- Hu, Y.; et al. Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis. arXiv 2023, arXiv:2312.08782. [Google Scholar]
- Firoozi, R.; et al. Foundation Models in Robotics: Applications, Challenges, and the Future. Int. J. Robot. Res. 2024. [Google Scholar] [CrossRef]
- Ma, Y.; Song, Z.; Zhuang, Y.; Hao, J.; King, I. A Survey on Vision-Language-Action Models for Embodied AI. arXiv 2024, arXiv:2405.14093. [Google Scholar]
- Kawaharazuka, K.; et al. Vision-Language-Action Models for Robotics: A Review Towards Real-World Applications. arXiv 2025, arXiv:2510.07077. [Google Scholar]
- Yu, Z.; Wang, B.; Zeng, P.; et al. A Survey on Efficient Vision-Language-Action Models. arXiv 2025, arXiv:2510.24795. [Google Scholar]
- Karcini, E.; Mehrban, F.; Ajoudani, A.; et al. Robots Need More Than VLAs and World Models. arXiv 2026, arXiv:2606.06556. [Google Scholar]
- Ha, D.; Schmidhuber, J. Recurrent World Models Facilitate Policy Evolution. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2018. [Google Scholar]
- Hafner, D.; Lillicrap, T.; Ba, J.; Norouzi, M. Dream to Control: Learning Behaviors by Latent Imagination. In Proceedings of the International Conference on Learning Representations (ICLR), 2020. [Google Scholar]
- Hafner, D.; Lillicrap, T.; Norouzi, M.; Ba, J. Mastering Atari with Discrete World Models. In Proceedings of the International Conference on Learning Representations (ICLR), 2021. [Google Scholar]
- Hansen, N.; Su, H.; Wang, X. TD-MPC2: Scalable, Robust World Models for Continuous Control. In Proceedings of the International Conference on Learning Representations (ICLR), 2024. [Google Scholar]
- NVIDIA; Agarwal, N.; et al. Cosmos World Foundation Model Platform for Physical AI. arXiv 2025, arXiv:2501.03575. [Google Scholar]
- Pertsch, K.; et al. FAST: Efficient Action Tokenization for Vision-Language-Action Models. arXiv 2025, arXiv:2501.09747. [Google Scholar]
- Zhao, T.Z.; Kumar, V.; Levine, S.; Finn, C. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. In Proceedings of the Robotics: Science and Systems (RSS), 2023. [Google Scholar]
- Kim, M.J.; Finn, C.; Liang, P. Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success. arXiv 2025, arXiv:2502.19645. [Google Scholar]
- Liu, S.; et al. RDT-1B: A Diffusion Foundation Model for Bimanual Manipulation. arXiv 2024, arXiv:2410.07864. [Google Scholar]
- NVIDIA; Bjorck, J.; et al. GR00T N1: An Open Foundation Model for Generalist Humanoid Robots. arXiv 2025, arXiv:2503.14734. [Google Scholar]
- Jang, E.; et al. BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning. In Proceedings of the Conference on Robot Learning (CoRL), 2021. [Google Scholar]
- Shridhar, M.; Manuelli, L.; Fox, D. CLIPort: What and Where Pathways for Robotic Manipulation. In Proceedings of the Conference on Robot Learning (CoRL), 2021. [Google Scholar]
- Reed, S.; et al. A Generalist Agent. Transactions on Machine Learning Research 2022. [Google Scholar] [CrossRef] [PubMed]
- Bousmalis, K.; et al. RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation. Transactions on Machine Learning Research 2023. [Google Scholar]
- Zhou, G.; Pan, H.; LeCun, Y.; Pinto, L. DINO-WM: World Models on Pre-Trained Visual Features Enable Zero-Shot Planning. arXiv 2024, arXiv:2411.04983. [Google Scholar]
- Jiang, Z.; et al. World4RL: Diffusion World Models for Policy Refinement with Reinforcement Learning for Robotic Manipulation. arXiv 2025, arXiv:2509.19080. [Google Scholar]
- Guo, Y.; Shi, L.X.; Chen, J.; Finn, C. Ctrl-World: A Controllable Generative World Model for Robot Manipulation. arXiv 2025, arXiv:2510.10125. [Google Scholar]
- Quevedo, J.; et al. WorldGym: World Model as an Environment for Policy Evaluation. arXiv 2025, arXiv:2506.00613. [Google Scholar]
- Jang, J.; et al. DreamGen: Unlocking Generalization in Robot Learning through Video World Models. arXiv 2025, arXiv:2505.12705. [Google Scholar]
- Ahn, M.; et al. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances. In Proceedings of the Conference on Robot Learning (CoRL), 2022. [Google Scholar]
- Belkhale, S.; Ding, T.; et al. RT-H: Action Hierarchies Using Language. In Proceedings of the Robotics: Science and Systems (RSS), 2024. [Google Scholar]
- Physical Intelligence.; Shi, L.X.; et al. Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models. arXiv 2025, arXiv:2502.19417. [Google Scholar]
- Physical Intelligence; Black, K.; et al. π0.5: A Vision-Language-Action Model with Open-World Generalization. arXiv 2025, arXiv:2504.16054. [Google Scholar]
- Ye, S.; et al. World Action Models are Zero-Shot Policies. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Zhang, Z.; et al. Towards Practical World Model-Based R Einforcement Learning for Vision-Language Action Models. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Dasari, S.; et al. RoboNet: Large-Scale Multi-Robot Learning. In Proceedings of the Conference on Robot Learning (CoRL), 2019. [Google Scholar]
- Mees, O.; Hermann, L.; Rosete-Beas, E.; Burgard, W. CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks. IEEE Robotics and Automation Letters 2022. [Google Scholar] [CrossRef]
- Li, X.; et al. Evaluating Real-World Robot Manipulation Policies in Simulation. In Proceedings of the Conference on Robot Learning (CoRL), 2024. [Google Scholar]
- Liu, Y.; et al. World Action Verifier: Self-Improving World Models via Forward-Inverse Asymmetry. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Bredis, G.; Balagansky, N.; Gavrilov, D.; Rakhimov, R. Next Embedding Prediction Makes World Models Stronger. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Zawalski, M.; et al. Robotic Control via Embodied Chain-of-Thought Reasoning. In Proceedings of the Conference on Robot Learning (CoRL), 2024. [Google Scholar]
- Zhang, X. What Do World Models Learn in Rl? Probing Latent Representations in Learned Environment Simulators. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Arghal, R.; et al. A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Farid, K. What Drives Compositional Generalization? The Importance of Continuous Training Objectives in Visual Generative Models. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Li, X.; et al. Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models. arXiv 2024, arXiv:2412.14058. [Google Scholar]
- Octo Model Team; Ghosh, D.; et al. Octo: An Open-Source Generalist Robot Policy. In Proceedings of the Robotics: Science and Systems (RSS), 2024. [Google Scholar]
- Wang, L.; Chen, X.; Zhao, J.; He, K. Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2024. [Google Scholar]
- Doshi, R.; Walke, H.; Mees, O.; Dasari, S.; Levine, S. Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation. In Proceedings of the Conference on Robot Learning (CoRL), 2024. [Google Scholar]
- Li, Q.; et al. CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation. arXiv 2024, arXiv:2411.19650. [Google Scholar]
- Qu, D.; et al. SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Models. arXiv 2025, arXiv:2501.15830. [Google Scholar]
- Zheng, R.; et al. TraceVLA: Visual Trace Prompting Improves Spatial-Temporal Awareness for Generalist Robotic Policies. arXiv 2024, arXiv:2412.10345. [Google Scholar]
- Wu, P.; Escontrela, A.; Hafner, D.; Abbeel, P.; Goldberg, K. DayDreamer: World Models for Physical Robot Learning. In Proceedings of the Conference on Robot Learning (CoRL), 2022. [Google Scholar]
- Micheli, V.; Alonso, E.; Fleuret, F. Transformers are Sample-Efficient World Models. In Proceedings of the International Conference on Learning Representations (ICLR), 2023. [Google Scholar]
- Schrittwieser, J.; et al. Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. Nature 2020, 588, 604–609. [Google Scholar] [CrossRef] [PubMed]
- Bruce, J.; et al. Genie: Generative Interactive Environments. In Proceedings of the International Conference on Machine Learning (ICML), 2024. [Google Scholar]
- Parker-Holder, J.; et al. Genie 2: A Large-Scale Foundation World Model. Google Deep. Blog 2024. [Google Scholar] [CrossRef]
- Yang, M.; et al. Learning Interactive Real-World Simulators. In Proceedings of the International Conference on Learning Representations (ICLR), 2024. [Google Scholar]
- Valevski, D.; Leviathan, Y.; Arar, M.; Fruchter, S. Diffusion Models are Real-Time Game Engines. arXiv 2024, arXiv:2408.14837. [Google Scholar]
- Alonso, E.; et al. Diffusion for World Modeling: Visual Details Matter in Atari. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2024. [Google Scholar]
- Kanervisto, A.; et al. World and Human Action Models towards Gameplay Ideation. Nature 2025, 638, 656–663. [Google Scholar] [CrossRef] [PubMed]
- Du, Y.; et al. Learning Universal Policies via Text-Guided Video Generation. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2023. [Google Scholar]
- Black, K.; Nakamoto, M.; et al. Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models. In Proceedings of the International Conference on Learning Representations (ICLR), 2024. [Google Scholar]
- Wu, H.; et al. Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation. In Proceedings of the International Conference on Learning Representations (ICLR), 2024. [Google Scholar]
- Cheang, C.L.; et al. GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation. arXiv 2024, arXiv:2410.06158. [Google Scholar]
- Kang, B.; et al. How Far is Video Generation from World Model: A Physical Law Perspective. arXiv 2024, arXiv:2411.02385. [Google Scholar]
- Wu, S.; et al. Rigid Bench: Evaluating Rigid-Body Physics in Video Generation Models. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Deng, Y.; et al. Rethinking Video Generation Model for the Embodied World. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Assran, M.; et al. Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. [Google Scholar]
- Bardes, A.; et al. Revisiting Feature Prediction for Learning Visual Representations from Video. Transactions on Machine Learning Research 2024. [Google Scholar] [CrossRef] [PubMed]
- LeCun, Y. A Path Towards Autonomous Machine Intelligence. OpenReview Prepr. 2022. [Google Scholar]
- Mur-Labadia, L.; et al. V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning. arXiv 2026, arXiv:2603.14482. [Google Scholar]
- Huang, W.; Chao, Y.W.; Mousavian, A.; Liu, M.Y.; Fox, D.; Mo, K.; Fei-Fei, L. Point World: Scaling 3D World Models for I N The-Wild Robotic Manipulation. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Wang, Y.; et al. West World: a Knowledge-Encoded Scalable Trajectory World Model for Diverse Robotics. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Abdulsalam, A. LaMO: a Latent Motion World Model for Long-Horizon Prediction. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Kim, H.; et al. Hier Archical Latent Action Model. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Xiang, C.; et al. Consistent Video World Model With Geometry-Aware Rotary Position Embedding. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Sharma, R. Cross-View World Models. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Chen, W.; et al. H-wm: Robotic Task and Motion Planning Guided by Hier Archical World Model. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Kumar, A.; et al. World-Gymnast: Training Robots with Reinforcement Learning in a World Model. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Li, C.; et al. Uncertainty-Aware Robotic World Model Makes Offline Model-Based Reinforcement Learning Work on Realrobots. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Kolling, A.H.; et al. Evidential Latent World Models for Safe Model-Based Reinforcement Learning. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Deb, R. Model Predictive Control with Differentiable World Models for Offline Reinforcement Learning. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Levy, J.; et al. Simulation Distillation: Pretraining World Models in Simulation for Rapid Real-World Adaptation. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Ennadir, S. Understanding Early Collapse in Predictive World-Model Pretraining. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Yang, K. Temporal Reversal Asymmetry: a Physics Inspired Metric for Evaluating World Models. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Mairukh, N.; et al. Phys Lang: a Small Diagnostic Framework for Language-Grounded World Modeling. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Zhu, Y. Do LLMs Build Spatial World Models? Evidence From Grid-World Maze Tasks. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Maes, L.; Lidec, Q.L.; Haramati, D.; Massaudi, N.; Scieur, D.; LeCun, Y.; Balestriero, R. Stable-worldmodel-V 1: Reproducible World Modeling Research and Evaluation. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Yang, Z.; et al. Physical Informed Driving World Model. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Scott, D.; et al. Coherence-Validated Causal World Models for Multi-Scale Alzheimer’s Disease Progression and Pharmacologic Reversal. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Scott, D.; et al. Reinforcement Learning with World Models for Optimizing Alzheimer ’ S Disease Treatment Timing and Dosing. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Memon, Z.; et al. Toward World Models for Epidemiology. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Supreeth, M.; et al. World Models as Execution Simulators for Automated Program Repair. In Proceedings of the International Conference on Learning Representations (ICLR), 2026. [Google Scholar]
- Guan, Y.; et al. Computer-Using World Model. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Zhang, Y.; et al. Cognitive Digital Twin Framework: Modeling and Real-Time Decision Making. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Terver, B.; et al. A Lightweight Library for Energy-Based Jointembedding Predictive Architectures. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Zhang, Q.; et al. Gridwm-Judge: Evaluating Vision-Language Model Judges in Grid Worlds via World Model Deficits. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Imagination, L. Latent Imagination Thinking: Beyond Recursive Models for Reasoning. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Niimi, J. The Mouth is Not the Brain: Bridging Energy Based World Models and Language Generation. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Wu, J.; et al. Visual Generation Unlocks Human-Like Reasoning Through Multimodal World Models. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Wang, Y.; Bigelow, E.; Ullman, T.; Tang, Y.; Risi, S. Integrating Simulation and Chain-Of-Thought Reasoning in Multimodal-Language Models For Physical Reasoning. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Yu, X.; et al. DYNA-Think: Synergizing Reasoning, Acting, and World Model Simulation in AI Agents. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Kobanda, A.; Radji, W. Intrinsic-Energy Joint Embedding Predictive Architectures Induce Quasimetric Spaces. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Facury, L.; et al. Learning Navigable World Models via Latent Energy Shaping. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Zhang, J.; et al. Reward-Forcing: Autoregressive Video Generation with Reward Feedback. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Meng, X.; et al. Identity-Grpo: Optimizing Multi-Human Identity-Preserving Video Generation via Reinforcement Learning. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Cudlenco, N.; et al. GEST-Engine: Controllable Multi-Actor Video Synthesis with Perfect Spatiotemporal Annotations. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Wang, J.; et al. Evoworld: Evolving Panoramic World Generation with Explicit 3D Memory. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Lee, A. Dexsim: Real-Time Dexterous Simulation with Unified Causal Video Diffusion. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Liu, S.; Wu, Z.; Yu, H.; Gao, J.; Alvarez, J.M. Structure From Diffusion: Taming Video Diffusion Models for Camera Pose Estimation in Dynamic Videos. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Mun, H.; Jin, I.H.; Kim, S.; Kong, K. Fluidworld: Fluid-Like Interactive Dynamics for 4D Worlds. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Duong, C.; et al. Toward Pixel-Grounded World Models for Powered Descent: Arocket Landing Benchmark and Expertbaseline. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Khalid, S.; et al. Latentgs: Probabilistic Densification for Efficient, Compact, and Faster 3D Gaussian Splatting. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Vaezpour, E.; Javadi, A.; Javidi, T. Active World-Model with 4D-informed Retrieval for Exploration and Awareness. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Can, T.; et al. Spa RRTA: a Synthetic Benchmark for Evaluating Spatial Intelligence in Visual Foundation Models. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Ma, W.; Wang, C.; Yuan, R.; Chen, H.; Dai, N.; Yang, Y.; Qian, C.; Wang, Z.Y.; Yuille, A.; Chen, J. Causal Spatial: a Benchmark for Objectc Entric Causal Spatial Reasoning. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Zhang, X.; et al. Predicting Camera Posefrom Perspective D Escriptions for Spatial Reasoning. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Ze, Y.; et al. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations. In Proceedings of the Robotics: Science and Systems (RSS), 2024. [Google Scholar]
- Huang, W.; et al. Inner Monologue: Embodied Reasoning through Planning with Language Models. In Proceedings of the Conference on Robot Learning (CoRL), 2022. [Google Scholar]
- Liang, J.; et al. Code as Policies: Language Model Programs for Embodied Control. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2023. [Google Scholar]
- Huang, W.; et al. VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models. In Proceedings of the Conference on Robot Learning (CoRL), 2023. [Google Scholar]
- Fang, K.; Liu, F.; Abbeel, P.; Levine, S. MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting. In Proceedings of the Robotics: Science and Systems (RSS), 2024. [Google Scholar]
- Driess, D.; et al. PaLM-E: An Embodied Multimodal Language Model. In Proceedings of the International Conference on Machine Learning (ICML), 2023. [Google Scholar]
- Zeng, X.; et al. Tree of Options: Temporally Extended World Modeling, Planning, and Execution with Large Language Models. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Feng, Y.; et al. Environment Maps: Structured Environmental Representations for Long-Horizon Agents. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Wang, Q.; Huang, W.; Zhou, Y.; Yin, H.; Bao, T.; Lyu, J.; Liu, W.; Zhang, R.; Wu, J.; Fei-Fei, L.; et al. Enact: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Zhang, J.; et al. Progress Lm: Towards Progress Reasoning in Vision-Language Models. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Lu, G.; et al. VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning. arXiv 2025, arXiv:2505.18719. [Google Scholar]
- Chen, K.; et al. πRL: Online RL Fine-Tuning for Flow-Based Vision-Language-Action Models. arXiv 2025, arXiv:2510.25889. [Google Scholar]
- Wang, H.; et al. RobustVLA: Robustness-Aware Reinforcement Post-Training for Vision-Language-Action Models. arXiv 2025, arXiv:2511.01331. [Google Scholar]
- Zhang, Z.; Zheng, K.; Chen, Z.; et al. GRAPE: Generalizing Robot Policy via Preference Alignment. arXiv 2024, arXiv:2411.19309. [Google Scholar]
- Nakamoto, M.; Mees, O.; Kumar, A.; Levine, S. Steering Your Generalists: Improving Robotic Foundation Models via Value Guidance. In Proceedings of the Conference on Robot Learning (CoRL), 2024. [Google Scholar]
- RLinf Team. RLinf: Reinforcement Learning Infrastructure for Embodied and Agentic AI. 2025. Available online: https://github.com/RLinf/RLinf.
- Physical Intelligence. openpi: Open-Source Robot Foundation Models. 2025. Available online: https://github.com/Physical-Intelligence/openpi.
- Lazzati, F. Robustness in the Face of Partial Identifiability in Reward Learning Problems. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Gupta, P.; Gupta, V. Bootstrapped Mixed Rewards for RL Posttraining: Injecting Canonical Action Order. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Wu, K.; et al. RoboMIND: Benchmark on Multi-Embodiment Intelligence Normative Data for Robot Manipulation. arXiv 2024, arXiv:2412.13877. [Google Scholar]
- Wu, S.; et al. RoboCOIN: An Open-Sourced Bimanual Robotic Data Collection for Integrated Manipulation. arXiv 2025, arXiv:2511.17441. [Google Scholar]
- Dexterous, E. D(r,o) Grasp: a Unified Representation of Robot and Object Interaction for Cross-Embodiment Dexterous Grasping. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Hejna, J.; Bhateja, C.; Jiang, Y.; Pertsch, K.; Sadigh, D. Re-Mix: Optimizing Data Mixtures for Large Scale Imitation Learning. In Proceedings of the Conference on Robot Learning (CoRL), 2024. [Google Scholar]
- Wang, R.; et al. Afford-VLA: Action-Aligned Visual Planning via Internalized Affordance. arXiv 2026, arXiv:2605.24203. [Google Scholar]
- Multi-Object, P. Mask2Act: Predictive Multi-Object Tracking as Video Pre-Training for Robot Manipulation. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Ke, T.W.; Gkanatsios, N.; Fragkiadaki, K. 3D Diffuser Actor: Policy Diffusion with 3D Scene Representations. In Proceedings of the Conference on Robot Learning (CoRL), 2024. [Google Scholar]
- Goyal, A.; et al. RVT: Robotic View Transformer for 3D Object Manipulation. In Proceedings of the Conference on Robot Learning (CoRL), 2023. [Google Scholar]
- Shridhar, M.; Manuelli, L.; Fox, D. Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation. In Proceedings of the Conference on Robot Learning (CoRL), 2022. [Google Scholar]
- Imitation, C. Sacil: Size-aware Contrastive Imitation Learning for Language-conditioned Multi-task Robotics. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Nair, S.; Rajeswaran, A.; Kumar, V.; Finn, C.; Gupta, A. R3M: A Universal Visual Representation for Robot Manipulation. In Proceedings of the Conference on Robot Learning (CoRL), 2022. [Google Scholar]
- Ma, Y.J.; et al. VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training. In Proceedings of the International Conference on Learning Representations (ICLR), 2023. [Google Scholar]
- Radosavovic, I.; Xiao, T.; James, S.; Abbeel, P.; Malik, J.; Darrell, T. Real-World Robot Learning with Masked Visual Pre-Training. In Proceedings of the Conference on Robot Learning (CoRL), 2022. [Google Scholar]
- Majumdar, A.; et al. Where are We in the Search for an Artificial Visual Cortex for Embodied Intelligence? In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2023. [Google Scholar]
- Karamcheti, S.; et al. Language-Driven Representation Learning for Robotics. In Proceedings of the Robotics: Science and Systems (RSS), 2023. [Google Scholar]
- Ziakas, C. Grounding Generated Videos in Feasible Plans via World Models. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Wu, Z.; et al. Speedup Patch: Learning a Plug-And-Play Policy to Accelerate Embodied Manipulation. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Mestha, H. Block Mamba: Efficient Scalable Structured Sparsity for Mamba. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Walke, H.; et al. BridgeData V2: A Dataset for Robot Learning at Scale. In Proceedings of the Conference on Robot Learning (CoRL), 2023. [Google Scholar]
- Khazatsky, A.; et al. Droid: a Large-Scale In-The-Wild Robot Manipulation Dataset. In Proceedings of the Robotics: Science and Systems, 2024. [Google Scholar]
- Fang, H.S.; et al. RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in One-Shot. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2024. [Google Scholar]
- Fu, Z.; Zhao, T.Z.; Finn, C. Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation. In Proceedings of the Conference on Robot Learning (CoRL), 2024. [Google Scholar]
- Jiang, S.; et al. Kaiwu: a Multimodal Manipulation Dataset and Framework for Robot Learning. In Proceedings of the IEEE Robotics and Automation Letters, 2025. [Google Scholar]
- Open-H-Embodiment Consortium.; Nelson, N.; et al. Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics. arXiv 2026, arXiv:2604.21017. [Google Scholar]
- Grauman, K.; et al. Ego4D: Around the World in 3,000 Hours of Egocentric Video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. [Google Scholar]
- Grauman, K.; et al. Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [Google Scholar]
- Hoque, R.; et al. EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video. arXiv 2025, arXiv:2505.11709. [Google Scholar]
- Li, Y.; et al. EgoLive: A Large-Scale Egocentric Dataset from Real-World Human Tasks. arXiv 2026, arXiv:2604.23570. [Google Scholar]
- Yu, T.; et al. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning. In Proceedings of the Conference on Robot Learning (CoRL), 2019. [Google Scholar]
- James, S.; Ma, Z.; Arrojo, D.R.; Davison, A.J. RLBench: The Robot Learning Benchmark and Learning Environment. IEEE Robotics and Automation Letters 2020. [Google Scholar] [CrossRef]
- Gu, J.; et al. ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills. In Proceedings of the International Conference on Learning Representations (ICLR), 2023. [Google Scholar]
- Tao, S.; et al. ManiSkill3: GPU Parallelized Robotics Simulation and Rendering for Generalizable Embodied AI. arXiv 2024, arXiv:2410.00425. [Google Scholar]
- Mandlekar, A.; et al. MimicGen: A Data Generation System for Scalable Robot Learning Using Human Demonstrations. In Proceedings of the Conference on Robot Learning (CoRL), 2023. [Google Scholar]
- Li, C.; et al. BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation. arXiv 2024, arXiv:2403.09227. [Google Scholar]
- Nasiriany, S.; et al. RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots. In Proceedings of the Robotics: Science and Systems (RSS), 2024. [Google Scholar]
- Jiang, Z.; et al. DexMimicGen: Automated Data Generation for Bimanual Dexterous Manipulation via Imitation Learning. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2025. [Google Scholar]
- Chi, C.; et al. Universal Manipulation Interface: In-the-Wild Robot Teaching Without In-the-Wild Robots. In Proceedings of the Robotics: Science and Systems (RSS), 2024. [Google Scholar]
- Wang, C.; et al. DexCap: Scalable and Portable Mocap Data Collection System for Dexterous Manipulation. In Proceedings of the Robotics: Science and Systems (RSS), 2024. [Google Scholar]
- Cheng, X.; et al. Open-TeleVision: Teleoperation with Immersive Active Visual Feedback. In Proceedings of the Conference on Robot Learning (CoRL), 2024. [Google Scholar]
- Fu, Z.; Zhao, Q.; Wu, Q.; Wetzstein, G.; Finn, C. HumanPlus: Humanoid Shadowing and Imitation from Humans. In Proceedings of the Conference on Robot Learning (CoRL), 2024. [Google Scholar]
- Tobin, J.; et al. Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017. [Google Scholar]
- Peng, X.B.; Andrychowicz, M.; Zaremba, W.; Abbeel, P. Sim-to-Real Transfer of Robotic Control with Dynamics Randomization. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2018. [Google Scholar]
- Akkaya, I.; et al. Solving Rubik’s Cube with a Robot Hand. arXiv 2019, arXiv:1910.07113. [Google Scholar]
- Xue, Z.; Deng, S.; Chen, Z.; Wang, Y.; Yuan, Z.; Xu, H. DemoGen: Synthetic Demonstration Generation for Data-Efficient Visuomotor Policy Learning. In Proceedings of the Robotics: Science and Systems (RSS), 2025. [Google Scholar]
- Yu, T.; et al. Scaling Robot Learning with Semantically Imagined Experience. In Proceedings of the Robotics: Science and Systems (RSS), 2023. [Google Scholar]
- Chen, Z.; Kiami, S.; Gupta, A.; Kumar, V. GenAug: Retargeting Behaviors to Unseen Situations via Generative Augmentation. In Proceedings of the Robotics: Science and Systems (RSS), 2023. [Google Scholar]
- Bharadhwaj, H.; et al. RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2024. [Google Scholar]
- Wang, L.; et al. GenSim: Generating Robotic Simulation Tasks via Large Language Models. In Proceedings of the International Conference on Learning Representations (ICLR), 2024. [Google Scholar]
- Wang, Y.; et al. RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation. In Proceedings of the International Conference on Machine Learning (ICML), 2024. [Google Scholar]
- Ha, H.; Florence, P.; Song, S. Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition. In Proceedings of the Conference on Robot Learning (CoRL), 2023. [Google Scholar]
- Gao, N.; et al. GenManip: LLM-Driven Simulation for Generalizable Instruction-Following Manipulation. arXiv 2025, arXiv:2506.10966. [Google Scholar]
- Gu, C.; et al. IGen: Scalable Data Generation for Robot Learning from Open-World Images. arXiv 2025, arXiv:2512.01773. [Google Scholar]
- Ghosh, R.; et al. Action Shapley: Atraining Dataselection Metric for Training World Models for Reinforcement Learning. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Wu, X.; et al. Motion Attribution for Video Generation. In Proceedings of the ICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. [Google Scholar]
- Lin, F.; Hu, Y.; Sheng, P.; Wen, C.; You, J.; Gao, Y. Data Scaling Laws in Imitation Learning for Robotic Manipulation. In Proceedings of the International Conference on Learning Representations (ICLR), 2025. [Google Scholar]
- Pearce, T.; et al. Scaling Laws for Pre-Training Agents and World Models. arXiv 2024, arXiv:2411.04434. [Google Scholar]
- Mandlekar, A.; et al. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation. In Proceedings of the Conference on Robot Learning (CoRL), 2021. [Google Scholar]
- Belkhale, S.; Cui, Y.; Sadigh, D. Data Quality in Imitation Learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2023. [Google Scholar]
- Wang, G.; et al. LIBERO-X: Robustness Litmus for Vision-Language-Action Models. arXiv 2026, arXiv:2602.06556. [Google Scholar]
- Guruprasad, P.; Sikka, H.; Song, J.; Wang, Y.; Liang, P.P. Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks. arXiv 2024, arXiv:2411.05821. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).