Appendix A. Supplementary Material
This appendix expands the roadmap and taxonomy used in the main paper. The main text focuses on the conceptual transition from LLM-based world knowledge to Physical AI under the page limit, while this appendix provides the additional context needed to make the roadmap verifiable and less table-only. In particular, we clarify the boundaries of the survey, contrast our organizing lens with existing survey perspectives, expand the taxonomy of roadmap stages, provide a more detailed world-model taxonomy, summarize evaluation protocols, list representative frontier systems, and discuss failure modes that motivate the challenges in the main paper.
The supplementary material is designed to serve two purposes. First, it makes explicit what is included and excluded in our survey. Since Physical AI overlaps with robotics, control, simulation, embodied AI, cyber-physical systems, multimodal learning, and model-based reinforcement learning, an unrestricted survey would be too broad and would obscure our central contribution. We therefore define the scope around the pathway from LLM-based world knowledge to grounded perception, grounded action, predictive world modeling, policy learning, and embodied deployment. Second, it provides additional references and categorizations that are not central enough to fit into the eight-page main paper but are useful for readers who want to trace the roadmap in more detail.
Across all tables, we use the same organizing principle: each component is described by its representational interface, its role in Physical AI, and its limitations. This makes the appendix complementary to the main paper rather than a separate literature catalogue. The tables are not intended to rank methods or claim that the listed systems are exhaustive. Instead, they provide representative anchors for the conceptual categories used throughout the survey.
Appendix A.1. Survey Scope and Boundary
Our survey is not intended to be an exhaustive review of all robotics, control, simulation, or cyber-physical systems. Instead, it studies Physical AI through the roadmap from LLM-based world knowledge to multimodal grounding, action grounding, world modeling, policy learning, and embodied deployment. This distinction is important because the term Physical AI is increasingly used across different communities with different assumptions: robotics work often emphasizes embodiment and control, vision work often emphasizes physical perception and generation, cyber-physical work often emphasizes deployment and sensing infrastructure, while language-centered work emphasizes reasoning, grounding, and agentic coordination.
The scope of this survey is therefore defined by whether a line of work contributes to the grounding of world knowledge into physical perception, prediction, planning, and action. For example, we include VLMs and MLLMs when they support spatial, temporal, or affordance grounding, but we do not attempt to cover all image captioning or visual question answering systems. Similarly, we include world models when they support prediction, simulation, planning, or policy learning for Physical AI, but we do not attempt to cover all video generation or all model-based reinforcement learning. This scope boundary is intended to reduce ambiguity for readers and reviewers: the paper is a roadmap survey centered on LLM-derived world knowledge, not a comprehensive encyclopedia of all physical intelligence.
Table A1 summarizes the intended scope. The middle column lists the categories we treat as part of the roadmap, while the right column identifies neighboring areas that are related but not exhaustively reviewed. This boundary also explains why some classical robotics, control, hardware, tactile sensing, and simulation topics are discussed only when they directly interact with foundation-model-based grounding or predictive modeling.
Table A1.
Scope boundary of this survey. We organize Physical AI as a roadmap from LLM-based world knowledge to grounded perception, action, world modeling, and embodied deployment, rather than as an exhaustive survey of all robotics or physical intelligence.
Table A1.
Scope boundary of this survey. We organize Physical AI as a roadmap from LLM-based world knowledge to grounded perception, action, world modeling, and embodied deployment, rather than as an exhaustive survey of all robotics or physical intelligence.
| Roadmap Component |
Included in This Survey |
Not Exhaustively Covered |
| LLM-based world knowledge |
Semantic, commonsense, procedural, causal, spatial, and affordance priors encoded in LLMs [6,7,8,9,140] |
General factual recall, knowledge editing, or memory analysis unrelated to physical reasoning |
| Multimodal grounding |
VLMs/MLLMs that ground language-derived knowledge into images, videos, regions, objects, spatial relations, and affordances [2,10,11,12,50] |
Generic captioning, VQA, or multimodal dialogue not tied to physical grounding or interaction |
| Action grounding |
VLA models, action representations, policy learning, and language-conditioned embodied control [16,17,18,19,81] |
Classical robot control, motion planning, or manipulation methods without foundation-model grounding |
| World models |
Video, latent, interactive, and action-conditioned models that support prediction, simulation, planning, or policy learning [20,56,95,96,98] |
All video generation, all simulators, or all model-based RL methods outside the Physical AI roadmap |
| Embodied systems |
Systems that close the loop between perception, planning, execution, recovery, and evaluation [88,107,108,141] |
Hardware-specific robot engineering, robot design, and domain-specific control stacks |
The table should be read as a boundary rather than a separation. Many excluded areas remain important to Physical AI, but they are not the organizing focus of this paper. For instance, low-level manipulation control and hardware design are indispensable for deployment, yet our discussion emphasizes how foundation models and world models interface with such systems. Likewise, video generation is relevant when it becomes temporally consistent, controllable, and action-conditioned, but generic video synthesis is not equivalent to physical world modeling.
Appendix A.2. Comparison with Existing Survey Perspectives
Existing survey perspectives cover important parts of the Physical AI landscape, but they usually begin from different assumptions. Broad Physical AI or PAI surveys often define the field from cyber-physical systems, robotics, sensing, and industrial deployment. Vision-centric generative Physical AI surveys emphasize physically grounded visual generation, physics-aware simulation, and visual understanding. VLA and robot foundation model studies focus on action spaces, robot policies, demonstrations, and embodiment-specific control. World-model-centered studies emphasize dynamics prediction, latent imagination, model-based planning, and simulation.
Our survey is complementary to these lines, but it starts from a different question: how can world knowledge encoded in LLMs be progressively grounded into perception, action, prediction, and deployment? This LLM-centered lens matters because many recent Physical AI systems use language models not merely as text interfaces, but as sources of semantic priors, task decompositions, commonsense constraints, tool orchestration, and agentic reasoning. At the same time, LLMs cannot model dense physical dynamics by themselves, which motivates the later transition toward VLA models and world models.
Table A2 clarifies this distinction. The goal of the comparison is not to claim that prior surveys are incomplete in their own scope. Rather, it shows that their organizing axes differ from ours. By making LLM-based world knowledge explicit, our survey connects the NLP and multimodal reasoning literature to Physical AI in a way that is not captured by purely robotics-centric, vision-centric, or world-model-only discussions.
Table A2.
Comparison with existing perspectives. Our survey is distinguished by using LLM-based world knowledge as the organizing lens and connecting it to multimodal grounding, VLA-style action interfaces, world models, policy learning, and deployable Physical AI systems.
Table A2.
Comparison with existing perspectives. Our survey is distinguished by using LLM-based world knowledge as the organizing lens and connecting it to multimodal grounding, VLA-style action interfaces, world models, policy learning, and deployable Physical AI systems.
| Perspective |
Main Focus |
LLM World Knowledge |
VLA / Action Grounding |
World Models |
Closed Systems |
| Broad Physical AI / PAI surveys [142,143] |
Concepts, applications, industrial systems, and cyber-physical perspectives |
Limited |
Partial |
Limited |
Partial |
| Vision-centric Generative Physical AI [57] |
Physics-aware generation, visual simulation, and physically grounded computer vision |
Limited |
Limited |
Partial |
Partial |
| VLA / robot foundation model studies [18,144,145] |
Robot policies, action representations, and embodied control |
Partial |
Strong |
Limited |
Partial |
| World-model-centered studies [20,56,96,98] |
Prediction, latent dynamics, model-based planning, simulation, and policy learning |
Limited |
Partial |
Strong |
Partial |
| Ours |
Roadmap from LLM-based world knowledge to Physical AI |
Strong |
Strong |
Strong |
Strong |
The comparison also motivates why a roadmap structure is more appropriate than a flat taxonomy. A flat taxonomy would list LLMs, VLMs, VLAs, world models, and agents as independent families. Our view instead treats them as progressively more physically grounded interfaces: language priors are grounded into perception, perception and language are grounded into action, and action must ultimately be supported by predictive models and closed-loop deployment.
Appendix A.3. Extended Roadmap Taxonomy
The roadmap in the main paper compresses a large amount of literature into a small number of stages.
Table A3 expands this roadmap by identifying the dominant representation at each stage, its role in Physical AI, and representative works. The table is intended to make explicit the hidden continuity between fields that are often discussed separately: NLP world knowledge, multimodal representation learning, robot action modeling, model-based prediction, policy learning, and embodied deployment.
A key design choice in this taxonomy is to organize methods by the interface through which knowledge becomes physically useful. LLM-based world knowledge is primarily textual and parametric. Multimodal grounding introduces visual and spatial representations. Action grounding introduces action tokens, trajectories, chunks, skills, and continuous controls. World modeling introduces future states, latent dynamics, rewards, values, or action-conditioned transitions. Policy learning then converts these representations into executable behavior, and embodied deployment tests whether the full stack can operate under feedback, noise, and real-world constraints.
This taxonomy also explains why no single model family currently solves Physical AI. LLMs provide broad priors but not dense physical state. VLMs and MLLMs ground perception but often remain language-mediated. VLAs provide an action-facing interface but struggle with cross-embodiment generalization and long-horizon prediction. World models provide predictive and simulative mechanisms but must still be connected to semantic goals, action interfaces, and reliable policies. Physical AI therefore emerges from the composition of these layers rather than from any one layer alone.
Table A3.
Extended taxonomy of the roadmap from LLM-based world knowledge to Physical AI.
Table A3.
Extended taxonomy of the roadmap from LLM-based world knowledge to Physical AI.
| Stage |
Main Representation |
Role in Physical AI |
Representative Works |
| LLM-based world knowledge |
Textual and parametric knowledge |
Provides semantic, commonsense, procedural, causal, spatial, and affordance priors |
LAMA, closed-book QA, factual recall, procedural knowledge [6,7,8,140] |
| Multimodal grounding |
Image/video-language representations |
Grounds world knowledge into objects, scenes, spatial relations, temporal events, and affordances |
CLIP, Flamingo, BLIP-2, LLaVA, Gemini [2,10,11,12,50] |
| Action grounding |
Action tokens, trajectories, chunks, skills, or continuous controls |
Maps perception and language instructions to executable actions |
PaLM-E, RT-2, OpenVLA, , [16,17,18,19,81] |
| World modeling |
Future pixels, latent states, rewards, values, or action-conditioned transitions |
Predicts and simulates possible futures for planning, policy learning, and counterfactual reasoning |
World Models, Dreamer, MuZero, Genie, Cosmos, V-JEPA [20,56,96,98,101,103] |
| Policy learning |
Learned policies, action experts, diffusion/flow policies, or controllers |
Converts perception, reasoning, and prediction into behavior |
ACT, FAST, RDT-1B, GR00T N1 [79,80,83,88] |
| Embodied deployment |
Closed-loop systems with sensing, planning, execution, verification, and recovery |
Tests whether models can reliably act in real or interactive environments |
Gemini Robotics, RoboCasa, LIBERO, EmbodiedBench [107,108,128,130,135] |
The representative works in the last column are selected as anchors rather than exhaustive lists. Many systems occupy multiple stages: for instance, a VLA system may combine multimodal grounding, action tokenization, policy learning, and real-world evaluation. We place each representative work according to its most salient role in the roadmap, while acknowledging that frontier systems increasingly blur these boundaries.
Appendix A.4. Extended Taxonomy of World Models
World models are used differently across reinforcement learning, video generation, robotics, autonomous driving, and Physical AI. In model-based reinforcement learning, a world model often refers to a learned transition or reward model used for planning. In video generation, the term is increasingly used for models that generate plausible future frames or interactive visual environments. In robotics and embodied AI, a world model should support action-conditioned prediction, counterfactual reasoning, recovery, and policy learning. These meanings overlap but are not identical.
To avoid treating all generative video models or all model-based policies as the same type of world model,
Table A4 organizes world models by prediction target and function in the Physical AI roadmap. This organization is important for the main paper’s argument: world models are the stage where Physical AI begins to move beyond language-mediated priors and toward directly learned predictive or simulative knowledge about physical dynamics.
Table A4.
Extended taxonomy of world models for Physical AI. The categories are organized by prediction target and their function in the roadmap from LLM-based world knowledge to embodied action.
Table A4.
Extended taxonomy of world models for Physical AI. The categories are organized by prediction target and their function in the roadmap from LLM-based world knowledge to embodied action.
| Category |
Prediction Target |
Role in Physical AI |
Representative Works |
| Classical / model-based RL world models |
Future states, rewards, values, or policy-relevant quantities |
Planning, latent imagination, decision making, and policy improvement |
World Models, PlaNet, Dreamer, DreamerV3, MuZero [20,95,96,97,98] |
| Video-space world models |
Future pixels, frames, or video tokens |
Visual imagination, future scene prediction, synthetic data, and simulated experience |
GAIA-1, UniSim, Genie, Cosmos [56,99,100,101] |
| Latent / representation-space world models |
Future latent states, embeddings, or masked spatiotemporal representations |
Efficient long-horizon prediction, compact planning, and control-relevant representation learning |
PlaNet, Dreamer, I-JEPA, V-JEPA, V-JEPA 2 [95,96,102,103,104] |
| Interactive / action-conditioned world models |
Future observations or latent states conditioned on candidate actions |
Counterfactual reasoning, simulation-based policy learning, safety evaluation, and recovery |
MuZero, UniSim, Genie, GAIA-1, V-JEPA 2, Cosmos [56,98,99,100,101,104] |
| World foundation models for Physical AI |
General-purpose predictive or generative world representations |
Adaptable substrate for robotics, autonomous driving, embodied agents, and synthetic data |
Cosmos, Genie-style models, V-JEPA-style models [56,101,103,104] |
Appendix A.4.1. Classical and Decision-Centric World Models.
Classical world models are rooted in model-based reinforcement learning and planning. They learn transition, reward, value, or policy-relevant predictions that allow agents to plan before acting. World Models, PlaNet, Dreamer, DreamerV3, and MuZero establish this foundation by showing that agents can learn compact internal models and use them for imagination, planning, and policy improvement [
20,
95,
96,
97,
98]. For Physical AI, this line provides the decision-making substrate: agents should not only react to observations, but also evaluate possible futures.
Appendix A.4.2. Video-Space World Models.
Video-space world models predict future visual observations. They are attractive for Physical AI because videos expose motion, temporal evolution, scene changes, and possible future outcomes. GAIA-1 models autonomous-driving futures from video, text, and action inputs [
99]; UniSim learns an interactive real-world simulator from heterogeneous data and uses it for policy training [
100]; Genie learns generative interactive environments from unlabelled videos [
101]; and Cosmos positions world foundation models as adaptable world models for Physical AI [
56]. The main limitation is that visual realism does not guarantee physical correctness. A generated rollout may look plausible while violating object permanence, contact constraints, controllability, or causal consistency.
Appendix A.4.3. Latent and Representation-Space World Models.
Latent world models predict in compact representation spaces rather than pixel space. This makes them more efficient for planning and control because they can focus on task-relevant dynamics instead of reconstructing every visual detail. JEPA-style models further argue that predictive modeling should happen in representation space rather than through full generative reconstruction [
21,
102,
103]. V-JEPA 2 extends this idea to video-scale learning and post-trains an action-conditioned latent world model for robot planning [
104]. For Physical AI, latent prediction is especially useful when the agent needs fast rollouts, uncertainty-aware planning, or long-horizon reasoning under limited computational budget.
Appendix A.4.4. Interactive and Action-Conditioned World Models.
Physical AI requires models that respond to actions, not only models that passively predict future frames. Interactive and action-conditioned world models estimate counterfactual futures under candidate actions, enabling planning, safety checking, policy learning, and recovery. This requirement separates physical world models from generic video generators: a Physical AI world model should be controllable, temporally consistent, action-conditioned, and useful for closed-loop decision making. Such models also make it possible to evaluate actions before executing them in the real world, reducing reliance on costly or unsafe trial-and-error deployment.
Appendix A.4.5. Relation to LLM-Based World Knowledge.
LLM-based world knowledge and world models are complementary. LLMs provide semantic, commonsense, procedural, and causal priors; world models provide predictive and simulative mechanisms for physical dynamics. The former tells an agent what actions may be meaningful; the latter estimates what is likely to happen if the agent acts. This complementarity explains why world models occupy a central position in the roadmap from LLM-based world knowledge to deployable Physical AI. In practice, future systems may combine LLMs for high-level goals, instructions, and commonsense constraints with world models for action-conditioned rollout, physical feasibility checking, and policy optimization.
Appendix A.5. Benchmarks and Evaluation Protocols
Evaluation is a central difficulty for Physical AI because the roadmap spans several different kinds of competence. Static language benchmarks can test whether a model encodes commonsense or procedural knowledge, but they cannot determine whether that knowledge is grounded in a physical state. Vision-language benchmarks can test perception and grounding, but they often stop at recognition or description. VLA benchmarks can test whether actions are predicted from observations and instructions, but open-loop action accuracy does not fully capture closed-loop execution. World-model benchmarks can test prediction, but prediction quality must ultimately be judged by whether it supports planning and control.
Table A5 summarizes evaluation protocols along the roadmap. The key shift is from static recognition or offline prediction to closed-loop task completion, robustness, recovery, and cross-embodiment generalization. This is aligned with the main paper’s argument that Physical AI should be evaluated by what a system can reliably do in the world, not only by what it can answer or predict from fixed inputs.
Table A5.
Evaluation protocols along the roadmap. Physical AI evaluation should shift from static recognition or offline prediction to closed-loop task completion, robustness, safety, recovery, and cross-embodiment generalization.
Table A5.
Evaluation protocols along the roadmap. Physical AI evaluation should shift from static recognition or offline prediction to closed-loop task completion, robustness, safety, recovery, and cross-embodiment generalization.
| Roadmap Stage |
Benchmark / Evaluation Type |
What to Evaluate |
Representative Works |
| LLM world knowledge |
Physical commonsense, tool understanding, factual/procedural knowledge |
Whether LLMs encode usable semantic, commonsense, procedural, and causal priors |
PHYBench, PhySense, PhysToolBench [44,45,49] |
| VLM/MLLM grounding |
Spatial, temporal, affordance, and physical reasoning benchmarks |
Whether language-derived knowledge is grounded into perception, spatial relations, and physical states |
BLINK, Video-MME, PhysBench, QuantiPhy, MASS-Bench [72,75,76,77,78] |
| VLA / action grounding |
Robot manipulation, navigation, and action-prediction benchmarks |
Whether models can map instructions and observations to executable actions |
RT-2, OpenVLA, LIBERO, LIBERO-Pro [17,18,130,132] |
| World models |
Video prediction, latent prediction, action-conditioned simulation, planning evaluation |
Whether models predict physically plausible, controllable, temporally consistent futures |
World Models, Dreamer, Genie, Cosmos, V-JEPA 2 [20,56,96,101,104] |
| Embodied agents |
Closed-loop simulated or real-world tasks |
Whether systems complete tasks, recover from errors, and generalize across environments and embodiments |
BEHAVIOR, EAI, EmbodiedBench, RoboSuite, RoboCasa [123,127,128,134,135] |
| Closed frontier systems |
Black-box or product-level evaluation |
Capability, reliability, reproducibility, safety, and transparency under limited disclosure |
Gemini Robotics, Gemini Robotics 1.5, GR00T N1, -series systems [19,81,88,107,108,141] |
A useful evaluation suite should therefore include both stage-specific and system-level metrics. Stage-specific metrics diagnose where a system fails: factual or physical commonsense, perceptual grounding, action prediction, world-model rollout, or closed-loop execution. System-level metrics evaluate whether these components work together under deployment constraints. For example, a strong VLA may still fail if its actions accumulate error, if its world model produces visually plausible but physically inconsistent futures, or if its controller cannot recover from perturbations. This is why task success, intervention count, robustness, safety, and recovery should be reported alongside conventional accuracy or prediction metrics.
Appendix A.6. Frontier Systems and Closed Models
Many influential Physical AI systems are released as frontier products, platforms, or partially documented technical reports rather than fully open academic artifacts. This creates a gap between real-world usage and academic evaluation. Closed or partially disclosed systems may demonstrate important capabilities, shape the terminology of the field, and influence user expectations, but their training data, architecture details, evaluation protocols, and failure cases are often unavailable.
Table A6 summarizes representative examples and their roles in the roadmap. We include them not as endorsements or as exhaustive comparisons, but because they represent the kinds of systems that motivate black-box evaluation, product-level benchmarking, and reproducibility discussions. In a survey of Physical AI, ignoring such systems would leave out a major part of the current landscape; however, treating them like fully open academic models would also be misleading. We therefore categorize them by role and openness.
Table A6.
Representative frontier systems and closed or partially disclosed models. These systems motivate product-level and black-box evaluation protocols in addition to conventional academic benchmarks.
Table A6.
Representative frontier systems and closed or partially disclosed models. These systems motivate product-level and black-box evaluation protocols in addition to conventional academic benchmarks.
| System |
Category |
Openness / Citation Type |
Role in the Roadmap |
| GPT-4 / ChatGPT-style agents [1] |
LLM / agentic assistant |
Closed / technical report or product documentation |
High-level world knowledge, planning, tool use, and task decomposition |
| Claude-style assistants [3] |
LLM / agentic assistant |
Closed / product documentation |
Reasoning, tool use, coding, and agentic orchestration |
| Gemini Robotics and Gemini Robotics 1.5 [107,108] |
Robotics foundation model |
Closed or partially disclosed / technical report |
Multimodal reasoning, embodied control, and real-world robot interaction |
| Cosmos [56] |
World foundation model platform |
Partially open / technical report |
World modeling, synthetic data, simulation, autonomous driving, and robotics |
| GR00T N1 [88] |
Humanoid foundation model |
Partially open / technical report |
Generalist humanoid policies and cross-embodiment action learning |
|
-series systems [19,81,141] |
Generalist VLA / robot foundation models |
Partially disclosed / technical reports |
Action grounding, open-world generalization, policy learning, and embodied deployment |
The main challenge posed by closed systems is not only that they are difficult to reproduce. It is also that they may combine several roadmap stages into a single product-level stack, making ablation and attribution difficult. A system may appear to have strong physical reasoning because of its LLM prior, its perception module, its action policy, its retrieval system, its simulator, or its human-feedback pipeline. Without transparent interfaces and standardized black-box tests, it is difficult to identify which component contributes to success or failure. This motivates evaluation protocols that separate capability testing, robustness testing, safety testing, and reproducibility reporting.
Appendix A.7. Failure Modes Along the Roadmap
The roadmap is useful not only because it organizes progress, but also because it localizes failures. A Physical AI system may fail at the level of world knowledge, perception, action, prediction, policy learning, deployment, or evaluation. These failures are qualitatively different. An LLM hallucination produces a plausible but ungrounded plan; a VLM grounding failure misidentifies the state of the world; a VLA failure maps a correct goal to the wrong action; a world-model failure predicts an implausible future; and a deployment failure can arise from sensing, latency, calibration, or controller mismatch.
Table A7 summarizes representative failure modes. These failures motivate our deployment-oriented discussion of challenges in the main paper. The table also clarifies why Physical AI cannot be evaluated by a single benchmark: each roadmap stage requires different diagnostics, and end-to-end task success alone may hide the source of failure.
Table A7.
Representative failure modes along the roadmap from LLM-based world knowledge to Physical AI.
Table A7.
Representative failure modes along the roadmap from LLM-based world knowledge to Physical AI.
| Component |
Typical Failure Mode |
Why It Matters for Physical AI |
| LLMs |
Hallucinated or ungrounded physical knowledge; overconfident plans; missing metric state |
The model may propose plausible language plans that violate geometry, contact, force, or object-state constraints |
| VLMs / MLLMs |
Correct semantic description but weak dense grounding |
The model may identify objects but fail to estimate pose, depth, uncertainty, reachability, or action-conditioned dynamics |
| VLAs |
Poor cross-embodiment generalization; data-dependent policies; brittle recovery |
The same instruction may require different grasps, trajectories, or control strategies across robots and environments |
| World models |
Visually plausible but physically inconsistent futures |
Photorealistic generation may still violate object permanence, contact, gravity, controllability, or causal dynamics |
| Policy learning |
Offline success but closed-loop failure |
A model may predict correct actions under dataset states but fail under compounding errors or real-time perturbations |
| Embodied systems |
Sensor, calibration, latency, controller, or hardware failures |
Physical performance depends on the full system stack, not only model accuracy |
| Closed frontier systems |
Limited reproducibility and incomplete disclosure |
Strong product-level systems can shape the field while being difficult to benchmark, ablate, or compare fairly |
The failure-mode view also suggests a practical debugging strategy. If a system fails before action, the issue may lie in world knowledge or perceptual grounding. If it fails during action, the issue may lie in action representation, embodiment transfer, or controller design. If it fails after several steps, the issue may lie in world modeling, compounding error, memory, or recovery. If it succeeds in simulation but fails in the real world, the issue may lie in sim-to-real transfer, sensing, calibration, latency, or hidden deployment assumptions. This decomposition turns the roadmap into an evaluation tool rather than just a taxonomy.
Appendix A.8. Extended Discussion of Challenges and Future Directions
The main paper summarizes the challenges along the roadmap with a small number of representative citations. Here we provide a more detailed discussion of the evidence behind each challenge and connect it to related work. The central point is that each stage in the roadmap exposes a different interface mismatch: LLMs expose world knowledge through sparse language; VLMs ground language into perception but often remain semantic; VLAs output actions but are tied to embodiment-specific action spaces; world models provide prediction but must be controllable and physically faithful; deployed systems must integrate all components under closed-loop constraints.
Appendix A.8.1. Implicit World Knowledge and Dense Physical Grounding.
LLM-based world knowledge is broad, but it is not an explicit symbolic database. It is stored as parametric regularities and exposed through prompting, context, plans, programs, or tool calls. Knowledge-probing and closed-book QA studies show that language models can store factual and relational knowledge in parameters [
6,
7], while later studies analyze long-tail factual acquisition, factual recall, and pretraining dynamics [
8,
9,
32]. Materialization and mechanistic analyses further show that parts of such knowledge can be extracted or traced through model computations [
33,
146,
147]. For Physical AI, however, these priors must be converted into dense physical variables such as pose, reachability, contact, force, friction, uncertainty, and temporal dynamics. Procedural knowledge and task-level planning provide useful priors [
34,
35,
37,
38,
39,
40,
41,
140], but they remain insufficient without perceptual grounding and physical verification.
Appendix A.8.2. Multimodal Grounding and Physically Faithful Perception.
VLMs and MLLMs provide the first major bridge from language-mediated priors to perceptual observations. Contrastive and multimodal pretraining align images or videos with language [
2,
10,
11,
12,
50], enabling models to connect objects, scenes, and instructions to visual inputs. However, Physical AI requires more than captioning or visual QA. It requires spatial grounding, temporal grounding, affordance estimation, quantitative reasoning, and dense frame-level understanding. Recent benchmarks and evaluations expose gaps in low-level visual perception, long-video understanding, physical reasoning, quantitative physics, and motion-aware spatiotemporal grounding [
72,
75,
76,
77,
78]. Related work on dense physical perception and intermediate-feature grounding suggests that VLM representations may be reused for downstream Physical AI tasks, but their outputs must be transformed into action-relevant or prediction-relevant representations [
14,
15,
144,
148].
Appendix A.8.3. VLA Generalization and Action-Interface Bottlenecks.
VLA models connect visual observations and language instructions to action outputs, making them a key interface between multimodal reasoning and embodied control. Representative systems such as PaLM-E, RT-2, OpenVLA, and the
-series demonstrate the promise of transferring web-scale or foundation-model knowledge into robotic action [
16,
17,
18,
19,
81]. Nevertheless, VLA policies face three persistent bottlenecks. First, action spaces vary across embodiments, including action tokens, end-effector poses, trajectories, action chunks, and continuous controls. Second, robot data remain much smaller and more heterogeneous than language or vision data. Third, imitation-trained policies can be brittle under distribution shift and may lack recovery behavior. Recent work on VLA learning, action abstraction, and generalist robot policies explores scalable action representations, data mixtures, and richer policy architectures [
79,
80,
83,
88,
141,
144,
145]. A promising direction is to augment VLA policies with memory, LLM agents, or world models so that policies do not only map observations to actions, but also reason over goals, histories, and possible futures.
Appendix A.8.4. World Models Beyond Video Generation.
World models provide the predictive and simulative substrate that VLAs often lack. Classical and decision-centric world models learn transition, reward, value, or policy-relevant quantities for planning and policy improvement [
20,
95,
96,
97,
98]. Video-space world models generate or predict future visual observations and can support visual imagination, synthetic data, and interactive simulation [
56,
99,
100,
101]. Latent and JEPA-style world models instead predict in representation space, trading pixel-level reconstruction for efficiency and planning-relevant abstraction [
21,
102,
103,
104]. This distinction is important: photorealistic generation does not guarantee physical correctness, while latent prediction may be efficient but difficult to interpret or use directly without task heads, decoders, or policy interfaces. For Physical AI, a useful world model should be temporally consistent, action-conditioned, controllable, physically plausible, and useful for planning or closed-loop decision making. Recent work on physically grounded world-model evaluation and language-guided latent prediction provides early steps toward this goal [
139,
148].
Appendix A.8.5. Deployment and System-Level Evaluation.
Even strong models can fail when deployed as Physical AI systems. World models and simulators can support training and planning, but real-world deployment introduces sensor noise, latency, calibration errors, embodiment mismatch, sim-to-real transfer, safety constraints, and recovery requirements. Robotics and embodied benchmarks such as LIBERO, RoboCasa, BEHAVIOR, RoboSuite, and EmbodiedBench evaluate different parts of this system-level challenge [
123,
128,
130,
134,
135]. Frontier systems such as Gemini Robotics, GR00T N1, and
-series models further show that Physical AI is becoming a product-level systems category, but many such systems are closed or only partially disclosed [
19,
81,
88,
107,
108,
141]. Future evaluation should therefore report closed-loop task completion, recovery, intervention counts, robustness, safety, reproducibility, and cross-embodiment generalization rather than relying only on static model-level metrics.
Table A8.
Extended analysis of challenges and future directions along the roadmap. The main paper summarizes these challenges with a small number of representative citations; this table provides additional evidence and connects each challenge to the corresponding interface mismatch.
Table A8.
Extended analysis of challenges and future directions along the roadmap. The main paper summarizes these challenges with a small number of representative citations; this table provides additional evidence and connects each challenge to the corresponding interface mismatch.
| Challenge |
Interface Mismatch |
Future Direction |
Representative Evidence |
| Implicit world knowledge |
LLM priors are language-mediated and difficult to convert into metric physical state |
Extract, align, and ground semantic, procedural, and causal priors into dense physical representations |
Knowledge probing, factual recall, procedural knowledge, language planning [6,7,8,37,140] |
| Physical perception |
VLMs often output semantic descriptions rather than dense physical state |
Move toward spatial, temporal, affordance, quantitative, and action-relevant grounding |
VLM/MLLM grounding and physical reasoning benchmarks [12,50,72,75,76,77] |
| Generalist VLA policies |
Actions are embodiment-specific and robot data are limited |
Develop scalable action representations, cross-embodiment transfer, and policies augmented with memory or world models |
PaLM-E, RT-2, OpenVLA, -series, FAST, GR00T N1 [16,17,18,19,79,88] |
| Predictive world modeling |
Video realism does not imply physical correctness; latent models may require task heads or policy interfaces |
Build action-conditioned, controllable, efficient, and physically plausible world models |
Dreamer, MuZero, Genie, UniSim, Cosmos, V-JEPA [56,96,98,100,101,103] |
| Deployment |
Model-level accuracy does not capture sensing, control, latency, recovery, safety, or sim-to-real robustness |
Evaluate integrated systems with closed-loop task completion, recovery, safety, and reproducibility |
LIBERO, RoboCasa, EmbodiedBench, Gemini Robotics, GR00T N1 [88,107,128,130,135] |
Appendix A.9. Terminology
We use several terms throughout the survey whose meanings vary across communities.
Table A9 records the definitions used in this paper. These definitions are intentionally functional: they describe the role each concept plays in the roadmap rather than attempting to settle all terminology debates in Physical AI, robotics, or model-based learning.
In particular, we distinguish world knowledge from world models. World knowledge refers to implicit priors about objects, actions, environments, and likely consequences, often encoded in the parameters of LLMs and exposed through prompting or agentic reasoning. A world model, by contrast, is a predictive or simulative mechanism that estimates how observations, latent states, rewards, values, or action consequences evolve. This distinction is central to our argument: LLMs help an agent reason about what is meaningful or plausible, while world models help estimate what is likely to happen under physical dynamics.
Table A9.
Terminology used throughout the survey.
Table A9.
Terminology used throughout the survey.
| Term |
Definition in This Survey |
| World knowledge |
Semantic, commonsense, procedural, causal, spatial, and affordance priors about objects, agents, actions, environments, and likely consequences. |
| LLM-based world knowledge |
Language-mediated world priors stored as parametric regularities in LLMs and exposed through prompting, context, or agentic reasoning. |
| World model |
A predictive or simulative model that estimates future observations, latent states, rewards, values, or action consequences from current states and possible actions. |
| VLA model |
A model that maps visual observations, language instructions, and sometimes embodiment states into executable actions or action-relevant representations. |
| Physical AI |
AI systems that ground world knowledge into multimodal perception, physical prediction, simulation, planning, policy learning, and real-world or interactive action. |