Preprint
Review

This version is not peer-reviewed.

Medical World Model: From Passive Prediction to Active Simulation in Medicine

Submitted:

28 April 2026

Posted:

30 April 2026

You are already at the latest version

Abstract
Clinical care is interventional. Physicians must decide how a patient's trajectory is likely to change under competing actions, not only estimate risk under the status quo. Most deployed medical artificial intelligence, however, remains optimized for classification or passive forecasting. We argue that the useful next abstraction is the medical world model, a learned system that represents patient state, models how that state evolves over time, accepts interventions such as drugs, doses, and procedures, and rolls trajectories forward under those interventions. Progress toward this goal is currently fragmented across digital twins, disease-trajectory models, surgical simulators, and generative electronic health record forecasting, with each community addressing a subset of the necessary ingredients. We organize the field with a capability ladder spanning representation, forecasting, single-arm projection, comparative treatment evaluation, and planning. Across imaging, physiology, longitudinal electronic health records, and surgical simulation, a consistent maturity pattern emerges. Representation and forecasting are widespread, narrow treatment-conditioned simulators are appearing, credible counterfactual comparison remains scarce, and validated treatment planners are absent. Once a model simulates what would happen under alternative treatments, causal validity becomes the binding constraint. Scaling data and generative modeling alone will not solve this. Credible medical world models also require explicit action definitions, causal design, and staged clinical validation with regulatory oversight. In this paper, the medical world model is a claims-to-evidence framework for simulation that can inform clinical decisions.
Keywords: 
;  ;  ;  ;  

1. Introduction

Consider a physician seeing a patient with newly diagnosed type 2 diabetes who is choosing between first-line metformin and a GLP-1 receptor agonist. The wrong initial choice can mean months of poor glycaemic control before any adjustment is made, after which the physician escalates to a more effective regimen given the updated data. Conventional risk models can estimate the probability of complications such as cardiovascular disease or nephropathy, but they cannot project the patient-specific differences in HbA1c, weight, side effects, and treatment adherence associated with which drug is started. This limitation points to the need for a different class of model that moves beyond a static risk estimator and toward a simulator that can project how the patient’s condition will evolve under each option. Such a model would learn how these variables change over time in response to treatment, generating trajectories that let the physician compare options before committing to a regimen. That capacity, generating and comparing plausible futures, is the defining function of a world model [1].
Formally, a world model is a learned transition function, often written as s ^ t + 1 = f ( s t , a t ) , where s t is the current state (in medicine, a representation of the patient), a t is the action taken (such as a treatment decision), and f predicts how that state changes in response. The idea was introduced by Ha and Schmidhuber [2] and has since been adopted widely in reinforcement learning. Dreamer [3], for example, learns a compact simulator of its environment and uses it to plan actions without acting in the real world. The analogy to medicine is partial. In reinforcement learning, the world model is one piece of a training loop in which an agent acts in an environment, receives reward, and updates a policy by imagining rollouts inside the simulator. Medicine provides none of those cleanly. There is no direct reward signal; clinical benefit is a negotiated combination of survival, quality of life, side effects, and preference. There is no interactive data collection; experimenting on patients to improve a simulator is ruled out by consent and safety. And a policy trained inside a learned clinical simulator would optimize against the simulator’s confounded training distribution rather than against reality. We therefore borrow the learned transition function and multi-step rollout from this literature, not the training loop that usually surrounds them. The concept extends to generative settings as well. Video generation models [4,5] have shown that predicting future states from current ones is itself a form of world modeling. The timescales differ, however. Robotic world models typically plan over seconds, whereas medical applications require projections over months or years.
Most medical AI systems in clinical use today are discriminative. They detect findings in images or estimate risk from patient records [6,7,8]. These systems do not model how a patient’s state will change over time under competing treatments. Causal inference methods address part of this gap by estimating what outcome would follow an alternative intervention [9]. But a causal effect estimate is a summary. It captures the average or conditional effect of one treatment relative to another, at a single endpoint. It does not produce a trajectory. A world model goes further. It learns how patient state evolves over time and generates full temporal paths under specified actions. The result is a reusable simulator, not a single effect estimate. This difference matters in practice. Forecasting one treatment-conditioned trajectory is a lower bar than comparing counterfactual alternatives for the same patient (Figure 1), and conflating the two overstates what current systems can support. For treatment-selection problems, where physicians need to weigh competing options for a specific patient, the gap between prediction and simulation is where medical AI falls short.
We define a medical world model as a learned system that satisfies four criteria simultaneously (Box 1; Figure 1). It must learn a compressed patient representation, model how that representation changes over time, accept clinical actions as inputs that influence the predicted trajectory, and roll the simulation forward over multiple steps. These criteria are cumulative. Each adds a capability that the previous ones lack. They also draw a clear boundary. Static risk scores and classifiers make predictions at a single time point but do not model temporal change. Retrieval-based decision support systems and purely mechanistic simulators lack learned patient representations. Each falls short for a different reason, and recognizing precisely where each system falls short is more useful than labeling it a partial world model.
Box 1. Four criteria for a medical world model.
1.
Learned state representation. The model learns a compressed representation of the patient, organ system, or clinical environment from data, rather than relying only on hand-engineered features.
2.
Temporal dynamics (state transitions). The model learns how the patient’s state evolves over time by modeling transitions between states, rather than making isolated, single-timepoint predictions.
3.
Intervention interface. The model accepts clinically meaningful actions (drugs, procedures, device settings) as inputs that causally influence state transitions, enabling simulation of alternative treatment choices rather than reflecting historical correlations alone.
4.
Multi-step rollout. The model can iteratively apply its transition dynamics under specified interventions to generate forward trajectories over multiple time steps (i.e., simulations), which can be assessed for plausibility, calibration, and clinical utility.
In this paper, patient state refers to the variables captured or inferable for a given clinical task, not to every latent determinant of health.
Several research communities are already building pieces of this simulator, though without a shared vocabulary. Digital twins [10], disease-trajectory models [11,12], surgical simulation [13], and generative EHR forecasting [14] each address a subset of the four criteria. Foundation models such as Med-PaLM [15] encode broad clinical knowledge but lack patient-specific dynamics and quantitative sensitivity to intervention choice. We assess their role as components of medical world models in Section 3.5. Our claim is that these building blocks are now sufficient to clearly define the space, separate precursors from full simulators, and identify the steps needed to move from passive prediction to clinically useful simulation.
This perspective, therefore, has four goals. First, we define medical world models operationally and place them on a capability ladder that separates representation, forecasting, single-arm projection, comparative treatment evaluation, and planning. Second, we use that ladder to assess where current systems in imaging, physiology, clinical trajectories, and surgery actually sit. Third, we argue that once counterfactual comparison becomes the target, the central challenge shifts from generative realism to causal validity, making methods from the causal inference and dynamic treatment regime literatures directly relevant. Finally, we outline how these models could be tested, validated, and deployed in stages without overstating their maturity. Throughout, our claim is narrower than the rhetoric often surrounding “digital twins” or “world models”. The value of this framework lies in its ability to discipline what current systems can credibly claim. Figure 2 illustrates this progression concretely using the type 2 diabetes patient, tracing how each paradigm, from rule-based clinical guidelines through discriminative ML, counterfactual causal models, and world models, answers a progressively more demanding clinical question about the same patient.

2. A Five-Level Capability Ladder for Medical World Models

The practical issue for a physician is which clinical question an AI system can answer, whether it uses a transformer, a diffusion model, or a neural ODE. Earlier work [16] introduced a four-level capability ladder, but treated counterfactual reasoning as a single stage and did not link capabilities to clinical implications or a concrete claims-to-evidence framework. Our proposed five-level capability ladder (Figure 3) ranks systems on this basis. Each level defines a clinical question, and higher levels impose stricter evidentiary requirements on any system that claims to answer it. The first two levels are precursor capabilities; only at L3 and above does a system model intervention-conditioned futures, and only at L4 does it recommend action. This is a conceptual synthesis rather than a systematic review; rung placement reflects our assessment of whether a system meets the four criteria in Box 1. To keep the abstraction concrete, we trace the diabetes patient from Section 1 through each level.

2.1. Five Levels of Capability

Representing the patient state (L1).

At the first level, a system learns to compress patient data (imaging, waveforms, laboratory time series, diagnostic codes) into a structured representation that captures clinically relevant variation. For the diabetes patient, an L1 system would encode blood sugar, weight, kidney function, and medication history into a compact latent state without projecting how those variables will change. These systems are the perceptual foundation. Without a good representation, no downstream simulation is possible, yet representation alone is not simulation.

Forecasting without intervention (L2).

L2 systems add temporal dynamics by forecasting from observed history, though they do not treat interventions as actions that can be meaningfully changed [17]. Many predictive models include medications or procedures as inputs; that alone does not make them interventional. Changing a treatment variable in an observational predictor may reflect historical correlation rather than a treatment effect. For a diabetes patient, an L2 system could forecast blood sugar and weight over the next year from the recorded trajectory, yet it could not reliably predict what would happen if the physician changed the drug. Disease-progression models, longitudinal imaging forecasters, and EHR sequence predictors, including recurrent and dynamic memory-based architectures such as DeepCare [18], live at this level. They meet criteria 1, 2, and 4, but because they lack a causal intervention interface - one in which changing a treatment input reflects a genuine effect rather than an observational correlation. Without this, they cannot answer the physician’s most pressing question, what happens if I change the treatment? Answering that question requires a model whose training or structure can separate treatment effects from the correlations present in observational data.

Projecting one treatment path (L3a).

L3a is the first genuinely interventional level. An L3a system accepts a specified action and forecasts the trajectory that would follow. For the diabetes patient, it could project blood sugar levels under metformin, or separately under the GLP-1 receptor agonist, but would not simultaneously compare the two. It answers one branch at a time. Validation requires showing that forecasts align with held-out data from patients who actually received the indexed treatment.

Comparing alternative interventions (L3b).

L3b compares predicted trajectories under two or more interventions that were not all observed for the same patient. For the diabetes patient, an L3b system would generate side-by-side trajectories under metformin and the GLP-1 receptor agonist, showing divergent weight loss, blood sugar control, and cardiovascular risk for the same individual. Mechanically, this extends L3a by running intervention-conditioned rollouts for each option and comparing the results. The difficulty is evaluation. Only one branch of reality is ever observed for a given patient, so standard held-out accuracy no longer applies. Credible evidence at this level requires one of three strategies: randomized data, causal inference methods applied to observational data, or benchmarking against published trial results through trial emulation. This evidentiary challenge is not unique to world models (it is the central problem of causal inference), but it becomes acute here because generative trajectory models can produce detailed, plausible-looking futures that mask the absence of causal grounding. A large literature on individualized treatment effect estimation, from causal forests to neural network approaches such as TARNet and its variants [19], addresses the counterfactual comparison problem for point outcomes. These methods operate at the same conceptual level as L3b but differ in a key respect. They estimate a treatment effect at a single endpoint rather than generating multi-step patient trajectories under competing actions. Systems that claim L3b capability without randomized, causal, or trial-emulation evidence should be treated as provisional.

Planning over time (L4).

The final level adds optimization. Instead of simulating one branch at a time and leaving the comparison to the physician, an L4 system searches across possible treatment sequences and recommends an adaptive strategy that adjusts at each decision point. The optimization target is not simply efficacy; it may balance glycaemic control against side effects, cost, adherence burden, or comorbidity risk, depending on how the objective is specified. For the diabetes patient, an L4 system might propose starting with metformin, then recommend escalation to the GLP-1 receptor agonist at a specific timepoint if projected blood sugar remains above target. It would continuously revise this plan as new observations arrive. Reinforcement-learning approaches to sepsis management [20] are early signals of this direction, though such retrospective studies remain methodologically debated [21], precisely because the reward, data, and policy-training assumptions of standard RL do not transfer cleanly to clinical care. These efforts build on a longer tradition of dynamic treatment regimes in biostatistics, which formalized sequential treatment optimization under uncertainty well before ML approaches [22,23]. To our knowledge, no L4 system has yet been validated prospectively in clinical care.
Table 1 maps each level to the clinical question it addresses, the allowable claim a system at that level can make, the minimum evidence required to support that claim, and a representative system. We call this a claims-to-evidence framework. Intervention-conditioned claims require evidence, and should otherwise be considered provisional rather than validated.
Readers familiar with Pearl’s causal hierarchy [24] of association, intervention, and counterfactuals will notice a structural resemblance. The ladder nevertheless adds distinctions that the hierarchy does not foreground. It requires a learned state representation as a prerequisite (L1), treats temporal rollout as a standing requirement above the first level, separates projecting one treatment path from comparing unobserved alternatives (the L3a/L3b split), and includes a planning level concerned with adaptive treatment sequences rather than a single contrast. The ladder is best read as a clinical claims framework built on causal ideas, not as a restatement of the causal hierarchy.

2.2. Clinical Domains and Boundaries

The ladder in Figure 3 has two organizing axes, with capability levels as rows and primary clinical domains as columns. We organize the landscape review in Section 3 around four domains, namely imaging and anatomical models, physiological and organ-system models, clinical trajectory models, and surgical and procedural models. Foundation models cut across all four and are discussed separately at the end of that section.
We exclude molecular and cellular simulations not because they are unimportant, but because they fail different parts of the Box 1 definition. AlphaFold [25] predicts static structure without action-conditioned temporal evolution (failing criteria 2 and 3), classical molecular dynamics [26] lacks a learned patient-state representation and clinical intervention interface (failing criteria 1 and 3), and many signaling models remain mechanistic rather than learned (failing criterion 1). These areas may eventually meet the full definition if coupled to organ- or patient-level simulators, but today they sit at the boundary. Moreover, recent proposals of virtual cells [27] extend this ambition to cell-level simulation, but current systems remain premature with respect to the criteria in Box 1, lacking validated intervention-conditioned dynamics and multi-step rollout.
Figure 3. A capability ladder for medical world models. Systems are organized by capability level (rows) and clinical domain (columns). Levels L1 and L2, below the dashed precursor line, are precursor capabilities that do not satisfy all four criteria of the medical world model definition (Box 1). At L3a and above, systems accept clinical interventions as inputs and qualify as full medical world models. Maturity is indicated by fill color and border style: green with solid border, clinically validated with prospective randomized evidence; peach with solid border, proof-of-concept with published evaluation; white with dashed border, early-stage prototype. The figure shows representative systems rather than an exhaustive catalogue; additional systems occupy similar positions within their respective domains. The type 1 diabetes digital twin is the only system with randomized clinical trial evidence. TRIALSCOPE is placed at the L3a/L3b boundary because it supports population-level trial emulation, but its classification as patient-level counterfactual simulation remains open (Section 3.3). That distinction is why L3b remains the least populated rung on the ladder. MedDreamer is placed at L4 as a directional prototype without prospective validation. Empty cells represent open research gaps. The dashed row below the matrix shows molecular and cellular simulation as an adjacent frontier that does not yet meet the full definition.
Figure 3. A capability ladder for medical world models. Systems are organized by capability level (rows) and clinical domain (columns). Levels L1 and L2, below the dashed precursor line, are precursor capabilities that do not satisfy all four criteria of the medical world model definition (Box 1). At L3a and above, systems accept clinical interventions as inputs and qualify as full medical world models. Maturity is indicated by fill color and border style: green with solid border, clinically validated with prospective randomized evidence; peach with solid border, proof-of-concept with published evaluation; white with dashed border, early-stage prototype. The figure shows representative systems rather than an exhaustive catalogue; additional systems occupy similar positions within their respective domains. The type 1 diabetes digital twin is the only system with randomized clinical trial evidence. TRIALSCOPE is placed at the L3a/L3b boundary because it supports population-level trial emulation, but its classification as patient-level counterfactual simulation remains open (Section 3.3). That distinction is why L3b remains the least populated rung on the ladder. MedDreamer is placed at L4 as a directional prototype without prospective validation. Empty cells represent open research gaps. The dashed row below the matrix shows molecular and cellular simulation as an adjacent frontier that does not yet meet the full definition.
Preprints 210845 g003

3. The Current Landscape of Medical World Models

With the ladder in hand, we now evaluate where current systems actually sit. We treat this evaluation as a stress test of the claims-to-evidence framework. Does it expose consistent gaps across the field (Figure 3)? The review covers four clinical domains; foundation models are treated separately at the end of this section. The ladder reveals the field’s maturity pattern directly. Learned state representation and unconditional forecasting are widespread at L1 and L2 across all four domains; credible treatment conditioned systems at L3a are few and narrow in scope; patient-level counterfactual comparison at L3b remains largely undemonstrated; and no domain has produced a prospectively validated L4 planner. The review below provides evidence for this pattern.

3.1. Imaging Offers Clear Prototypes with Narrow Scope

MeWM [28] conditions tumor-response simulation on treatment and evaluates outputs for radiological plausibility, while CLARITY [29] models disease trajectories in a learned latent space. Both are architecturally L3a in that they accept a treatment as input and generate a conditioned trajectory. Their reported evaluations measure proxies for L3a validity: for MeWM, radiologist Turing tests and agreement with the administered protocol; for CLARITY, survival concordance and treatment-planning F1. Neither reports the L3a evidentiary standard in Table 1, which is trajectory fidelity against observed outcomes in held-out patients who received the indexed treatment. Treating them as L3a proofs-of-concept rather than evidentially validated L3a systems is the honest reading. Their narrowness is equally informative. Each generates one indexed trajectory at a time, over a tightly defined disease process, and neither has been validated as a patient-level comparator across alternative regimens. Adjacent work such as CheXWorld [30] and TaDiff [31] helps define the boundary. World-model-style pretraining and treatment-aware longitudinal generation can improve representation learning and disease-specific forecasting without yet supporting comparative counterfactual claims. The gap to L3b is therefore substantial. These systems forecast under a specified treatment, but they do not yet compare alternative treatments for the same patient.

3.2. Physiology: The Deepest Roots, the Strongest Clinical Evidence

Physiological modeling is the most mature substrate for medical world models [32]. Established governing equations for cardiac electrophysiology, glucose-insulin regulation, and pharmacokinetic compartments constrain what the learned components can produce, and that structural advantage is what sets this domain apart. Hybrid approaches, such as Med-Real2Sim, a physics-informed digital twin that combine mechanistic priors with self-supervised learning, further illustrate how structured constraints can stabilize learned dynamics in clinical settings [33]. A neural network embedded in such a framework cannot generate a heart that contracts without depolarizing or a glucose trajectory that violates mass balance. This kind of constraint has no parallel in clinical trajectory modeling, where the state space is defined only by whatever codes and laboratory values appear in the EHR.
The strongest prospective evidence for any system discussed in this paper comes from diabetes. Two pilot-scale randomized trials now exist. Builes-Montaño and colleagues reported a 7 percentage-point improvement in time in range over four weeks (n = 28) [34]. A separate trial using digital-twin replay simulation for automated insulin delivery showed a 5 percentage-point improvement sustained over six months (n = 72) [35]. Both are small and short, but their convergence on a comparable effect size in a tightly bounded physiological setting is the strongest prospective signal in this review. That result is notable not only for the effect size but for what made validation tractable. The glucose-insulin system is well characterized, the action space is narrow (insulin dose and timing), and the outcome (continuous glucose) is measured densely enough to evaluate trajectory accuracy rather than relying on a single endpoint. Generative simulators that produce realistic glucose-insulin trajectories [36] provided the modeling foundation for this line of work.
In cardiology, whole-heart electromechanical surrogates built on neural ordinary differential equations compress expensive biophysical simulations while retaining the dynamics that matter for clinical phenotyping [37]. Large-scale cardiac digital twin populations have been constructed for thousands of individuals [38], though none has yet been tested in a prospective interventional study. Pharmacokinetic neural ODEs have shown the ability to extrapolate across dosing regimens not seen during training [39], a property that depends on the compartmental structure baked into the model rather than on the flexibility of the network alone. More broadly, pharmacokinetic (PK) and pharmacodynamic (PD) modeling represents decades of prior art in what this paper calls intervention-conditioned simulation. PK/PD models have been used in regulatory drug development submissions [40] for far longer than the medical world model framing has existed, and the validation challenges they have encountered, including extrapolation across dosing regimens, population heterogeneity, and model-informed decision-making under uncertainty, closely parallel those discussed in Section 4.
The limitation common to all three areas is scope. Each operates on a single organ system with a carefully bounded action space. The physician in our opening example, choosing between metformin and a GLP-1 receptor agonist, needs a model that captures interactions across glucose metabolism, cardiovascular risk, renal function, and weight, variables governed by different equations and rarely observed together at the temporal resolution these models require. Cross-organ coupling is where the mechanistic advantage begins to erode, because the governing equations for different organ systems do not compose straightforwardly, and the longitudinal multimodal data needed to learn those couplings are seldom collected in a single cohort. The near-term opportunity is hybrid systems that preserve mechanistic structure where equations are trusted and substitute learned dynamics where clinical data outstrip first-principles knowledge.

3.3. Clinical Trajectories: Rich Data, Limited Causal Footing

Clinical trajectory modeling is where the data are richest and where the L3b gap matters most. Foresight [11] and DT-GPT [12] show that predictive AI models can learn broad temporal structure from EHRs, operating on the messy substrate physicians actually use, including sparse diagnoses, irregular laboratory values, and medication histories. MedDreamer [41] takes a further step by using an AI agent that internally imagines future patient states to guide sepsis management, but it remains a research prototype rather than a validated clinical planner.
The conceptual challenge is that good next-state forecasting does not automatically imply counterfactual validity. Most EHR trajectory systems remain L2 because they forecast what tends to happen, not what would happen under an explicitly different action. TRIALSCOPE [42] is the clearest current example relevant to L3b. It simulated eleven advanced non-small cell lung cancer trials and matched the published hazard ratio in all nine trials for which a reference was available. Even so, TRIALSCOPE operates at the population level, emulating aggregate trial outcomes rather than branching individual patient trajectories. World models can in principle operate at multiple scales, from populations down to individual patients, and across time horizons from hours to years. In this perspective, we focus on patient-level simulation over clinical decision horizons (weeks to months), because that is the scale at which treatment-selection problems arise. The gap between population-level emulation and patient-level counterfactual branching is why L3b remains the least populated and most consequential rung on the ladder.

3.4. Surgery: Visual Realism Without Physical Grounding

Surgical world models illustrate a different barrier to L3b. Systems such as SurgWM [43] and Cosmos-H-Surgical [44] can generate plausible future surgical scenes conditioned on instrument actions, but benchmarks built around expert assessment [45] reveal a large plausibility gap. Visually realistic videos often fail where surgeons care most, at instrument-tissue causality and physically consistent deformation. Without patient-specific anatomy and physically grounded validation, these systems approach L3a in the visual sense but cannot support the counterfactual comparisons (e.g., alternative surgical approaches for the same patient) that L3b would require.

3.5. Foundation Models: Provisional Bridges, not Grounded Simulators

Large language models are best understood here as bridges rather than pillars. DT-GPT [12] uses an LLM as the forecasting engine for clinical trajectories, and work on emergent internal representations [46] suggests that sequence-trained models can internalize temporal structure. Yet current medical LLMs remain weakly grounded in quantitative physiology and vulnerable to hallucination [47]. Without explicit state-space dynamics, they function more as stochastic text generators than grounded simulators. The more defensible near-term role is hybridization (Figure 4a), in which the LLM acts as an agentic interface layer that proposes candidate actions and summarizes uncertainty, while a domain-specific world model supplies the grounded dynamics, and the physician rather than a learned policy selects among them. This division of labor keeps the generative flexibility of LLMs where it is useful (natural language interaction, option enumeration) and constrains it where reliability matters (trajectory simulation, uncertainty quantification). Foundation models should be evaluated against the same claims-to-evidence framework as any other system; their fluency does not exempt them from the evidentiary requirements of the level they claim to operate at.

4. Barriers to Reliable Medical World Models

The barriers to reliable medical world models are connected, and their ordering matters for where the field should focus its effort. The barriers range from unobserved confounding in retrospective EHRs that limit movement from L2 to L3b, to computational scalability and clinical workflow integration. We focus here on the subset that most immediately constrains progress toward credible simulation.

4.1. Validating Unobserved Futures

The core methodological obstacle is that the most interesting output of a medical world model is often unobservable. We can observe what happened after the treatment that was given, but not what would have happened under a different choice. That makes validation harder than in ordinary prediction. Held-out longitudinal data are sufficient for L3a systems, but L3b claims require stronger evidence from causal inference [9], including trial emulation [42,48], established G-methods [49,50], causal machine-learning approaches [19], synthetic environments with known ground truth, or prospective studies [34]. The formal apparatus for this problem exists in the causal inference literature. Pearl’s do-calculus distinguishes observational associations from intervention effects; Robins’s G-methods handle time-varying confounding in longitudinal data. Neither has yet been systematically imported into world model development, and extending these tools to govern multi-step generative dynamics rather than point outcomes remains an open methodological problem. Agreement with published trial-level effect estimates is supportive but not sufficient for individual-level counterfactual simulation. The field still lacks standardized evaluation protocols that separate observed-trajectory accuracy, interventional validity, and downstream clinical utility. Existing verification and validation frameworks for computational models in medical devices, such as ASME V&V 40, offer a starting point because they were designed for exactly this problem, evaluating simulators whose outputs cannot be directly observed. Semi-synthetic benchmarks in which ground-truth treatment effects are known by construction can provide intermediate validation between retrospective evaluation and prospective trials.

4.2. Failure Modes of Medical World Models

A world model that produces anatomically impossible tumor regression, implausible glucose responses, or impossible surgical states is not merely inaccurate. It is potentially dangerous. The risk is sharper still if such a simulator is used not only for display but as a training environment for a downstream policy. Any bias in the learned transition function, whether from confounding, missing covariate, or action mis-specification, becomes a target the policy learns to exploit rather than correct, because the policy sees only the simulator and not reality. A surgical model, for example, may generate a convincing video while deforming tissue in ways that violate anatomy or instrument forces. Safety therefore depends on more than mean predictive performance. These systems must distinguish uncertainty caused by limited knowledge from uncertainty inherent to clinical variability [51]. The first should trigger abstention. The second should be communicated as prediction intervals. They also need to detect unfamiliar patients or scenarios, abstain conservatively, and make clear which outputs are simulated rather than observed. This is especially important for generative foundation-model components [52], which can produce fluent but false trajectories unless they are grounded in explicit state dynamics. Hybrid mechanistic-neural models partially mitigate these risks by constraining the output space through governing equations, which is one reason the strongest prospective evidence to date comes from physiology-based systems where such constraints are available.

4.3. Data Gaps in Patient Modeling

The data problem is equally foundational. Real patients are described by partially observed, multimodal, irregularly sampled signals, including scans, vitals, laboratory tests, notes, medication histories, devices, and sometimes genomics. Most current systems simplify aggressively to one modality or one organ system. The datasets required to train and validate patient-level simulators are also hard to assemble because privacy constraints, consent limits, and institutional governance rules often restrict linkage, sharing, and external validation across sites. Beyond these structural barriers, the data themselves reflect practice variation that confounds causal claims. Coding habits differ across institutions, insurance and formulary rules influence what gets prescribed and tested, and recording inconsistencies mean that the same clinical event may appear differently in different systems. These differences are not merely noise for a machine learning model to smooth over. They shape the observed transition dynamics and therefore the world model itself. Action spaces are likewise under-specified. A treatment is rarely just “drug A”. It is drug, dose, timing, combination therapy, adherence, and co-treatment context. Without richer state and action representations, many apparent counterfactual claims will remain thin. Federated learning and privacy-enhancing technologies [53] offer one path forward because they can expand training coverage across institutions without requiring full centralization of patient records, though they do not remove the need for harmonized state and action definitions across sites.

4.4. Generalizability Across Populations and Settings

Generalization is not a cosmetic concern. A simulator trained in one hospital system, geography, or the majority population may generate systematically wrong trajectories for underrepresented groups or unfamiliar settings. In a generative model, the risk can be larger than in an ordinary classifier. A classifier that underperforms on a subgroup produces a wrong label. A simulator trained on historically biased treatment patterns may instead generate an entire future that reflects those same inequities. For disadvantaged populations, it may project worse outcomes simply because they received less aggressive care in the training data. The same mechanism that caused a widely used risk-prediction algorithm to underestimate illness severity in Black patients [54] could, in a world-model setting, shape whole simulated trajectories. Recent work on scaling medical AI across clinical contexts [55] underscores the scale of the problem. At the same time, a simulator that is well validated for a specific population remains clinically useful for that population, provided the intended-use specification is clear about where the model does and does not apply. The more pragmatic path may be to demonstrate that medical world models work in well-characterized settings with dense data capture first, accept the initial limitation in generalizability, and then expand to broader populations as evidence and data accumulate. Medical world models will need explicit strategies for transportability, subgroup auditing, and recalibration across institutions, but demanding universal generalizability before any deployment risks stalling progress that could benefit the populations for which validation is already feasible.

4.5. Prioritizing the Barriers

Counterfactual validation is the most fundamental because, without it, the other barriers are moot. Data richness is the most tractable near-term target because it depends on infrastructure and governance decisions rather than unsolved methodology. Safety and uncertainty quantification are the barriers most likely to block deployment even when the science is ready, because regulators and physicians will not accept simulators that cannot communicate their own limitations. Equity and transportability require sustained attention from the start of system design, not as an afterthought once a model is already trained on a narrow population.

5. Translating Medical World Models to the Clinic

5.1. Integrating Simulation into Clinical Workflows

From the physician’s perspective, the key question is where simulation enters real care. Three near-term placements are plausible. Pre-visit or tumor-board planning allows offline comparison of candidate strategies and can tolerate minutes of computation. Encounter-time decision support shows the expected effect of changing therapy or dose but must return results within the pace of a clinical conversation, favoring lightweight surrogates or precomputed scenario libraries. Trial design and synthetic control arm construction use world models to emulate trial outcomes or generate comparator arms when randomized data are incomplete or infeasible [42]. This constraint becomes sharper when systems must roll out multiple candidate futures over several time steps at inference time rather than produce a single static prediction.
These workflow placements also imply different minimum capability levels. Encounter-time decision support can be useful at L3a if the system forecasts the likely trajectory under a specified action, whereas tumor-board planning and comparative treatment review more often require L3b because physicians need to weigh multiple plausible alternatives for the same patient. Trial design and real-world evidence generation sit closest to L3b as population-level counterfactual emulation tasks rather than full patient-level planning.
Benefit will depend not only on model accuracy but also on interface design. Physicians should be shown simulated futures as bounded, uncertainty-aware scenarios rather than as single authoritative answers, with clear separation between observed data and model-generated trajectories, side-by-side comparison of candidate actions, and explicit cues for when the model is extrapolating beyond its training experience.
These use cases imply the staged validation pathway shown in Figure 4b: retrospective accuracy, physician plausibility testing [28], silent prospective deployment, and only then interventional testing. The type 1 diabetes digital twin trial [34] remains notable precisely because such prospective evidence is still rare.

5.2. Recognizing When Simulation Is Unreliable

Clinical deployment depends on systems knowing when they should not be trusted. For medical world models specifically, the reliability problem differs from ordinary classification in three ways. First, errors compound across rollout steps, so that a small bias in the transition function can produce a trajectory that diverges substantially from reality over a clinically relevant horizon. Second, uncertainty should grow with the length of the rollout, and a system that reports constant confidence across a 12-month simulation is almost certainly miscalibrated. Third, a simulated trajectory can look internally coherent, with plausible vital signs, laboratory values, and clinical events, while being causally wrong because the treatment effect was learned from confounded observational data. A usable medical world model must therefore output uncertainty intervals that widen with rollout length, flag patients or scenarios unlike those seen during training, and make clear which parts of the output are simulated rather than observed.
If a world model informs treatment decisions, it would likely be classified as software-as-a-medical-device. Under the EU Medical Device Regulation [56], Rule 11 places software that informs therapeutic decisions in Class IIa or higher. The EU Artificial Intelligence Act [57] treats AI embedded in regulated medical devices as high-risk, requiring risk management, data governance, and human oversight. In the United States, the FDA’s guidance on predetermined change control plans [58] offers one pathway for AI-enabled devices that are updated over time. These frameworks were built mainly for discriminative models with observable outputs. World models raise a harder regulatory question because their core product is not a label or a score but a simulated future that cannot be directly verified at the time of use. Regulators will need documentation that distinguishes observed-trajectory accuracy from interventional validity, clarifies the intended clinical use and action scope, and specifies how models will be monitored, recalibrated, and when necessary, withdrawn.

5.3. Near-Term Opportunities Without New Model Training

The most urgent near-term need is evaluation that distinguishes world-model capabilities from ordinary forecasting. The field needs a benchmark suite organized around the capability ladder. At L2, such a benchmark would test trajectory prediction accuracy on held-out patients. At L3a, it would evaluate whether a model’s treatment-conditioned forecasts match outcomes in patients who actually received the indexed treatment. At L3b, it would compare simulated treatment contrasts against known effect sizes from randomized trials or well-designed trial emulations. For the diabetes patient introduced in Section 1, an L3a benchmark might ask whether a model can predict glucose trajectories under a specified insulin regimen, while an L3b benchmark would ask whether the model can recover the known difference in time-in-range between two dosing strategies. The field also needs shared longitudinal, multimodal benchmark datasets with explicit state, action, and outcome definitions across EHR systems. Without such shared schemas, results will remain hard to compare, reproduce, and transport across institutions. In parallel, existing frontier LLMs should be tested as implicit clinical world models without new training, using intervention-conditioned trajectory tasks (e.g., given this patient’s history, what happens to HbA1c if treatment changes from metformin to a GLP-1 agonist?) to establish where prompting alone is sufficient and where grounded dynamics remain indispensable.

5.4. Building Composite and Hybrid Systems

The medium-term agenda is about composition. Multimodal patient world models should integrate EHR trajectories, imaging, notes, and waveforms into a unified patient representation rather than forcing each domain into a separate simulator. At the same time, hybrid mechanistic-neural models should become a first-class design pattern, especially in cardiology [37,38], glucose control [36], and pharmacology [39], where trusted equations already exist. Surgical rehearsal is another tractable target: current video world models [43,44] should be pushed toward patient-specific procedural planning by linking visual generation to anatomy and physics rather than treating surgery as generic video prediction.

5.5. Toward Causal Simulation and Closed-Loop Care

Longer term, the field must move from correlation-rich forecasting to causal medical world models [19,49,59] that remain reliable when treatments or patient populations shift. Methods such as G-Net [49], which uses g-computation to generate counterfactual predictions under dynamic treatment regimes, represent the closest current bridge between the causal inference and world model literatures. They share the temporal structure and intervention conditioning that define a world model but differ in their primary goal, which is effect estimation rather than building a reusable patient simulator. Sharpening that boundary, understanding when effect estimation is sufficient and when a full simulator adds value, is an open methodological question. The scientific target is a whole-patient simulator in which drug effects propagate across organ systems and comorbidities interact over clinically meaningful time horizons. Reaching that goal will require intermediate milestones, including multi-organ models that couple two or three systems, interaction layers that propagate side effects between organ simulators, and standardized interfaces for combining domain-specific models. The operational target is L4, closed-loop clinical AI that combines a grounded world model with treatment-strategy optimization [20]. That goal remains distant, in part because off-policy evaluation in critical care is methodologically difficult [21]. MedDreamer [41] is an early signal, but the safety and governance burden for genuine clinical control will be far higher than for retrospective prediction.
This agenda is not only for model builders. Physicians should demand uncertainty-aware trajectory displays, subgroup performance reporting, and opportunities to participate in plausibility review and silent prospective validation before routine deployment. Regulators will need evaluation standards that separate observed-trajectory accuracy from interventional validity and require ongoing post-deployment monitoring. Health systems, in turn, will need governed access to longitudinal multimodal data, infrastructure for secure model updating and audit, and workflow integration that allows simulations to be reviewed, challenged, and overridden rather than passively accepted.
Box 2. What medical world models mean for physicians.
Unlike risk scores or classifiers, a medical world model generates forward trajectories of how a patient’s condition may evolve under different treatment choices, supporting the kind of comparative reasoning that physicians already perform mentally. However, the vast majority of current systems remain at the level of observation or unconditional prediction (L1/L2). Only a handful of systems condition on interventions (L3a), and credible counterfactual comparisons (L3b) have been demonstrated in very few settings, and a digital twin for type 1 diabetes insulin dosing is one of the clearest examples of randomized prospective evaluation.
For physicians considering how these tools might enter practice, two points deserve emphasis. First, uncertainty communication matters as much as predictive accuracy: a useful world model must convey when its simulations are reliable and when they are speculative, through calibrated confidence intervals and clear abstention signals. Second, world models trained on historically biased data can generate entire simulated trajectories that encode and perpetuate disparities, making subgroup auditing and equity-aware evaluation essential before any clinical deployment.

6. From Prediction to Clinical Simulation

This perspective has argued that medicine does not primarily need more accurate predictions; it needs models whose simulated futures are aligned with the decisions they are meant to support. The capability ladder proposed here makes that alignment testable by separating representation, passive forecasting, treatment-conditioned rollout, counterfactual comparison, and planning, and by tying each level to a corresponding evidentiary standard. Read against this ladder, the field has not yet moved beyond the representation and forecasting rungs at scale.
The framework generates a concrete near-term prediction. If causal validity rather than generative realism is the binding constraint, then the first systems to reach validated L3b status should come from settings with narrow action spaces and well-measured outcome endpoints, where state dynamics are partly understood from prior physiology or pharmacology. Physiology-based digital twins for glucose control, pharmacokinetic dose adjustment, and trial-emulation frameworks for bounded oncology comparisons fit this description. Image-based tumor-trajectory simulators generate visually plausible futures but cannot yet support patient-level comparative validation, and will not cross the L3b threshold until the evaluation problem is solved rather than the generative one. If the first validated L3b systems instead come from generative imaging, the framework will need revision.
The underlying normative principle is plainer. The strength of a clinical claim must be matched by the strength of the evidence behind it. That applies to researchers building systems, to developers deploying them, and to regulators approving them. Without a shared language for stating that match, the field risks rhetorical inflation around systems whose evidence does not support their claims.
The capability ladder proposed here offers that language. Its value is not in the systems it describes but in the claims it disciplines. If the field adopts an evidentiary standard commensurate with the decisions these models are meant to support, medical AI may move from passive prediction toward simulation that clinicians can actually use. If it does not, the language of world models will continue to outpace their clinical reality.

References

  1. LeCun, Y. A Path Towards Autonomous Machine Intelligence. OpenReview Preprint 2022. Version 0.9.2, 2022-06-27.
  2. Ha, D.; Schmidhuber, J. Recurrent world models facilitate policy evolution. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar]
  3. Hafner, D.; Pasukonis, J.; Ba, J.; Lillicrap, T. Mastering diverse control tasks through world models. Nature 2025, 640, 647–653. [Google Scholar] [CrossRef]
  4. Brooks, T.; Peebles, B.; Holmes, C.; DePue, W.; Guo, Y.; Jing, L.; Schnurr, D.; Taylor, J.; Luhman, T.; Luhman, E.; et al. Video generation models as world simulators. OpenAI Blog 2024, 1, 1. [Google Scholar]
  5. Parker-Holder, J.; Ball, P.; Bruce, J.; Dasagi, V.; Holsheimer, K.; Kaplanis, C.; Moufarek, A.; Scully, G.; Shar, J.; Shi, J.; et al. Genie 2: A large-scale foundation world model. URL: https://deepmind. google/discover/blog/genie-2-a-large-scale-foundation-world-model 2024, 2.
  6. Topol, E.J. High-performance medicine: the convergence of human and artificial intelligence. Nat. Med. 2019, 25, 44–56. [Google Scholar] [CrossRef]
  7. Esteva, A.; et al. A guide to deep learning in healthcare. Nat. Med. 2019, 25, 24–29. [Google Scholar] [CrossRef] [PubMed]
  8. Rajkomar, A.; et al. Scalable and accurate deep learning with electronic health records. npj Digit. Med. 2018, 1, 18. [Google Scholar] [CrossRef]
  9. Hernán, M.A.; Robins, J.M. Causal Inference: What If; Chapman & Hall/CRC: Boca Raton, 2020. [Google Scholar]
  10. Sadée, C.; et al. Medical digital twins: enabling precision medicine and medical artificial intelligence. Lancet Digit. Health 2025, 7, 100864. [Google Scholar] [CrossRef]
  11. Kraljevic, Z.; et al. Foresight: a generative pretrained transformer for modelling of patient timelines using electronic health records: a retrospective modelling study. Lancet Digit. Health 2024, 6, e281–e290. [Google Scholar] [CrossRef]
  12. Makarov, N.; et al. Large language models forecast patient health trajectories enabling digital twins. npj Digit. Med. 2025, 8, 588. [Google Scholar] [CrossRef] [PubMed]
  13. Long, Y.; Lin, A.; Kwok, D.H.C.; Zhang, L.; Yang, Z.; Shi, K.; Song, L.; Fu, J.; Lin, H.; Wei, W.; et al. Surgical embodied intelligence for generalized task autonomy in laparoscopic robot-assisted surgery. Sci. Robot. 2025, 10, eadt3093. [Google Scholar] [CrossRef]
  14. Yang, Z.; et al. TransformEHR: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records. Nat. Commun. 2023, 14, 7857. [Google Scholar] [CrossRef] [PubMed]
  15. Singhal, K.; et al. Large language models encode clinical knowledge. Nature 2023, 620, 172–180. [Google Scholar] [CrossRef] [PubMed]
  16. Qazi, M.A.; Nadeem, M.; Yaqub, M. Beyond Generative AI: World Models for Clinical Prediction, Counterfactuals, and Planning. In Proceedings of the NeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning, 2025.
  17. Tomašev, N.; et al. A clinically applicable approach to continuous prediction of future acute kidney injury. Nature 2019, 572, 116–119. [Google Scholar] [CrossRef]
  18. Pham, T.; Tran, T.; Phung, D.; Venkatesh, S. Deepcare: A deep dynamic memory model for predictive medicine. In Proceedings of the Pacific-Asia conference on knowledge discovery and data mining. Springer, 2016, pp. 30–41.
  19. Shalit, U.; Johansson, F.D.; Sontag, D. Estimating individual treatment effect: generalization bounds and algorithms. In Proceedings of the International conference on machine learning. PMLR, 2017, pp. 3076–3085.
  20. Komorowski, M.; Celi, L.A.; Badawi, O.; Gordon, A.C.; Faisal, A.A. The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care. Nat. Med. 2018, 24, 1716–1720. [Google Scholar] [CrossRef]
  21. Gottesman, O.; Johansson, F.; Komorowski, M.; Faisal, A.; Sontag, D.; Doshi-Velez, F.; Celi, L.A. Guidelines for reinforcement learning in healthcare. Nat. Med. 2019, 25, 16–18. [Google Scholar] [CrossRef]
  22. Murphy, S.A. Optimal dynamic treatment regimes. J. R. Stat. Soc. Ser. B Stat. Methodol. 2003, 65, 331–355. [Google Scholar] [CrossRef]
  23. Chakraborty, B.; Murphy, S.A. Dynamic Treatment Regimes. Annu. Rev. Stat. Its Appl. 2014, 1, 447–464. [Google Scholar] [CrossRef]
  24. Pearl, J. Causality; Cambridge university press, 2009. [Google Scholar]
  25. Jumper, J.; Evans, R.; Pritzel, A.; et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021, 596, 583–589. [Google Scholar] [CrossRef]
  26. Hollingsworth, S.A.; Dror, R.O. Molecular Dynamics Simulation for All. Neuron 2018, 99, 1129–1143. [Google Scholar] [CrossRef]
  27. Cheng, X.; Li, P.; Guo, H.; Liang, Y.; Gong, J.; de Vazelhes, W.; Gou, C.; Xie, P.; Song, L.; Xing, E.P. Harnessing AI to Build Virtual Cells. bioRxiv 2026, pp. 2026–04. [Google Scholar] [CrossRef]
  28. Yang, Y.; Wang, Z.Y.; Liu, Q.; Sun, S.; Wang, K.; Chellappa, R.; Zhou, Z.; Yuille, A.; Zhu, L.; Zhang, Y.D.; et al. Medical world model: Generative simulation of tumor evolution for treatment planning. arXiv preprint arXiv:2506.02327 2025. [Google Scholar]
  29. Ding, T.; Zou, Y.; Chen, C.; Shah, M.; Tian, Y. CLARITY: Medical World Model for Guiding Treatment Decisions by Modeling Context-Aware Disease Trajectories in Latent Space. arXiv preprint arXiv:2512.08029 2025. [Google Scholar]
  30. Yue, Y.; Wang, Y.; Tao, C.; Liu, P.; Song, S.; Huang, G. CheXWorld: Exploring image world modeling for radiograph representation learning. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 20778–20788.
  31. Liu, Q.; Fuster-Garcia, E.; Hovden, I.T.; MacIntosh, B.J.; Grødem, E.O.; Brandal, P.; Lopez-Mateu, C.; Sederevičius, D.; Skogen, K.; Schellhorn, T.; et al. Treatment-aware diffusion probabilistic model for longitudinal MRI generation and diffuse glioma growth prediction. IEEE Trans. Med. Imaging 2025, 44, 2449–2462. [Google Scholar] [CrossRef]
  32. Corral-Acero, J.; et al. The `Digital Twin’ to enable the vision of precision cardiology. Eur. Heart J. 2020, 41, 4556–4564. [Google Scholar] [CrossRef] [PubMed]
  33. Kuang, K.; Dean, F.; Jedlicki, J.B.; Ouyang, D.; Philippakis, A.; Sontag, D.; Alaa, A. Med-real2sim: Non-invasive medical digital twins using physics-informed self-supervised learning. Adv. Neural Inf. Process. Syst. 2024, 37, 5757–5788. [Google Scholar]
  34. Builes-Montaño, C.E.; et al. A digital twin-enhanced decision support system improves time-in-range in type 1 diabetes: a randomized clinical trial. Sci. Rep. 2025, 15, 39738. [Google Scholar] [CrossRef]
  35. Kovatchev, B.P.; Colmegna, P.; Pavan, J.; Diaz Castañeda, J.L.; Villa-Tamayo, M.F.; Koravi, C.L.; Santini, G.; Alix, C.; Stumpf, M.; Brown, S.A. Human-machine co-adaptation to automated insulin delivery: a randomised clinical trial using digital twin technology. npj Digit. Med. 2025, 8, 253. [Google Scholar] [CrossRef] [PubMed]
  36. Mujahid, O.; et al. Generative deep learning for the development of a type 1 diabetes simulator. Commun. Med. 2024, 4, 51. [Google Scholar] [CrossRef]
  37. Salvador, M.; Strocchi, M.; Regazzoni, F.; Augustin, C.M.; Dede’, L.; Niederer, S.A.; Quarteroni, A. Whole-heart electromechanical simulations using latent neural ordinary differential equations. npj Digit. Med. 2024, 7, 90. [Google Scholar] [CrossRef]
  38. Qian, S.; et al. Developing cardiac digital twin populations powered by machine learning provides electrophysiological insights in conduction and repolarization. Nat. Cardiovasc. Res. 2025, 4, 624–636. [Google Scholar] [CrossRef] [PubMed]
  39. Lu, J.; Deng, K.; Zhang, X.; Liu, G.; Guan, Y. Neural-ODE for Pharmacokinetics Modeling and Its Advantage to Alternative Machine Learning Models in Predicting New Dosing Regimens. iScience 2021, 24, 102804. [Google Scholar] [CrossRef]
  40. Mould, D.R.; Upton, R.N. Basic concepts in population modeling, simulation, and model-based drug development. CPT Pharmacomet. Syst. Pharmacol. 2012, 1, 1–14. [Google Scholar] [CrossRef]
  41. Xu, Q.; Habib, G.; Wu, F.; Perera, D.; Feng, M. MedDreamer: Model-Based Reinforcement Learning with Latent Imagination on Complex EHRs for Clinical Decision Support. arXiv preprint arXiv:2505.19785 2025. [Google Scholar]
  42. González, J.; Ueno, R.; Wong, C.; Gero, Z.; Bagga, J.; Chien, I.; Oravkin, E.; Kiciman, E.; Nori, A.; Weerasinghe, R.; et al. TRIALSCOPE—A framework for clinical trial simulation from real-world data. NEJM AI 2025, 2, AIoa2400859. [Google Scholar] [CrossRef]
  43. Koju, S.; Bastola, S.; Shrestha, P.; Amgain, S.; Shrestha, Y.R.; Poudel, R.P.; Bhattarai, B. Surgical vision world model. In Proceedings of the MICCAI Workshop on Data Engineering in Medical Imaging. Springer, 2025, pp. 1–10.
  44. He, Y.; Guo, P.; Xu, M.; Li, Z.; Myronenko, A.; Imans, D.; Liu, B.; Yang, D.; Gu, M.; Ji, Y.; et al. Cosmos-H-Surgical: Learning Surgical Robot Policies from Videos via World Modeling. arXiv preprint arXiv:2512.23162 2025. [Google Scholar]
  45. Chen, Z.; Xu, Q.; Wu, J.; Yang, B.; Zhai, Y.; Guo, G.; Zhang, J.; Ding, Y.; Navab, N.; Luo, J. How Far Are Surgeons from Surgical World Models? A Pilot Study on Zero-shot Surgical Video Generation with Expert Assessment. arXiv preprint arXiv:2511.01775 2025. [Google Scholar]
  46. Li, K.; Hopkins, A.K.; Bau, D.; Viégas, F.; Pfister, H.; Wattenberg, M. Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task. ICLR 2023. [Google Scholar]
  47. Thirunavukarasu, A.J.; Ting, D.S.J.; Elangovan, K.; Gutierrez, L.; Tan, T.F.; Ting, D.S.W. Large language models in medicine. Nat. Med. 2023, 29, 1930–1940. [Google Scholar] [CrossRef]
  48. Hernán, M.A.; Robins, J.M. Using big data to emulate a target trial when a randomized trial is not available. Am. J. Epidemiol. 2016, 183, 758–764. [Google Scholar] [CrossRef]
  49. Li, R.; Hu, S.; Lu, M.; Utsumi, Y.; Chakraborty, P.; Sow, D.M.; Madan, P.; Li, J.; Ghalwash, M.; Shahn, Z.; et al. G-net: a recurrent network approach to g-computation for counterfactual prediction under a dynamic treatment regime. In Proceedings of the Machine Learning for Health. PMLR, 2021, pp. 282–299.
  50. Robins, J.M.; Hernan, M.A.; Brumback, B. Marginal structural models and causal inference in epidemiology. Epidemiology 2000, 11, 550–560. [Google Scholar] [CrossRef]
  51. Abdar, M.; et al. A Review of Uncertainty Quantification in Deep Learning: Techniques, Applications and Challenges. Inf. Fusion 2021, 76, 243–297. [Google Scholar] [CrossRef]
  52. Teo, Z.L.; et al. Generative Artificial Intelligence in Medicine. Nat. Med. 2025, 31, 3270–3282. [Google Scholar] [CrossRef]
  53. Rieke, N.; Hancox, J.; Li, W.; Milletari, F.; Roth, H.R.; Albarqouni, S.; Bakas, S.; Galtier, M.N.; Landman, B.A.; Maier-Hein, K.; et al. The future of digital health with federated learning. npj Digit. Med. 2020, 3, 119. [Google Scholar] [CrossRef] [PubMed]
  54. Obermeyer, Z.; Powers, B.; Vogeli, C.; Mullainathan, S. Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations. Science 2019, 366, 447–453. [Google Scholar] [CrossRef] [PubMed]
  55. Li, M.M.; Reis, B.Y.; Rodman, A.; Cai, T.; Dagan, N.; Balicer, R.D.; Loscalzo, J.; Kohane, I.S.; Zitnik, M. Scaling Medical AI across Clinical Contexts. Nat. Med. 2026, 32, 439–448. [Google Scholar] [CrossRef]
  56. Regulation (EU) 2017/745 of the European Parliament and of the Council on medical devices. Official Journal of the European Union, 2017.
  57. Regulation (EU) 2024/1689 of the European Parliament and of the Council laying down harmonised rules on artificial intelligence. Official Journal of the European Union, 2024.
  58. U.S. Food and Drug Administration. Marketing Submission Recommendations for a Predetermined Change Control Plan for Artificial Intelligence-Enabled Device Software Functions. FDA Guidance Document, 2024.
  59. Bica, I.; et al. From Real-World Patient Data to Individualized Treatment Effects Using Machine Learning: Current and Future Methods to Address Underlying Challenges. Clin. Pharmacol. Ther. 2021, 109, 87–100. [Google Scholar] [CrossRef]
Figure 1. The four criteria defining a medical world model and the capability they each unlock. Each row introduces one criterion and isolates its unique contribution to the patient model. Gray elements are inherited from previous rows; colored elements are unique to that row. C1 (teal) learns a latent patient state. C2 (blue) adds a transition function f ( s ) that projects the state forward in time. C3 (amber) adds a clinical action argument, f ( s ) f ( s , a ) , making the trajectory conditioned on clinical actions. C4 (purple) applies f ( s , a ) repeatedly, producing diverging trajectories from the same patient under competing actions a and b.
Figure 1. The four criteria defining a medical world model and the capability they each unlock. Each row introduces one criterion and isolates its unique contribution to the patient model. Gray elements are inherited from previous rows; colored elements are unique to that row. C1 (teal) learns a latent patient state. C2 (blue) adds a transition function f ( s ) that projects the state forward in time. C3 (amber) adds a clinical action argument, f ( s ) f ( s , a ) , making the trajectory conditioned on clinical actions. C4 (purple) applies f ( s , a ) repeatedly, producing diverging trajectories from the same patient under competing actions a and b.
Preprints 210845 g001
Figure 2. A capability progression for clinical AI. Each stage represents a generation of medical AI and the clinical question it can answer for the same patient. Rule-based systems encode expert knowledge as conditional logic. Discriminative ML classifies observations and forecasts near-term outcomes but does not model how patients change in response to treatment choices. Counterfactual causal models estimate what would happen under an alternative intervention by learning treatment effect functions. World models simulate how a patient’s state evolves under specified treatments, enabling comparison of trajectories before committing to a decision. These paradigms coexist in practice; later paradigms build on earlier ones rather than replacing them.
Figure 2. A capability progression for clinical AI. Each stage represents a generation of medical AI and the clinical question it can answer for the same patient. Rule-based systems encode expert knowledge as conditional logic. Discriminative ML classifies observations and forecasts near-term outcomes but does not model how patients change in response to treatment choices. Counterfactual causal models estimate what would happen under an alternative intervention by learning treatment effect functions. World models simulate how a patient’s state evolves under specified treatments, enabling comparison of trajectories before committing to a decision. These paradigms coexist in practice; later paradigms build on earlier ones rather than replacing them.
Preprints 210845 g002
Figure 4. A hybrid architecture for clinical world models and a staged pathway to validation. (a) In a hybrid design, patient data (electronic health records, imaging, waveforms, device data) feeds both the domain-specific world model and the LLM interface layer; the physician also reviews the raw patient data directly. The world model simulates patient trajectories under candidate interventions. The LLM acts as an agentic interface layer, proposing candidate actions and summarizing options. The physician reviews simulated outcomes alongside the original patient data, uncertainty estimates, and model confidence before making a treatment decision. (b) Before clinical deployment, world-model-based decision support should pass through five stages of validation with increasing rigor. Because world models generate unobserved futures rather than classifying observed inputs, each stage must address challenges specific to simulation. Retrospective evaluation on held-out data must assess trajectory plausibility, not just endpoint accuracy. Expert plausibility assessment requires physicians to judge whether simulated disease courses and treatment responses are clinically coherent. Silent prospective deployment must compare predicted trajectories against outcomes as they unfold, not only at a single follow-up point. Interventional testing through randomized or controlled studies must evaluate whether model-informed decisions improve outcomes relative to standard care. Post-deployment surveillance must monitor for distributional drift in both patient populations and treatment patterns, triggering recalibration when the model’s training assumptions no longer hold. To date, only the type 1 diabetes digital twin trial has reached stage four (asterisk in panel b).
Figure 4. A hybrid architecture for clinical world models and a staged pathway to validation. (a) In a hybrid design, patient data (electronic health records, imaging, waveforms, device data) feeds both the domain-specific world model and the LLM interface layer; the physician also reviews the raw patient data directly. The world model simulates patient trajectories under candidate interventions. The LLM acts as an agentic interface layer, proposing candidate actions and summarizing options. The physician reviews simulated outcomes alongside the original patient data, uncertainty estimates, and model confidence before making a treatment decision. (b) Before clinical deployment, world-model-based decision support should pass through five stages of validation with increasing rigor. Because world models generate unobserved futures rather than classifying observed inputs, each stage must address challenges specific to simulation. Retrospective evaluation on held-out data must assess trajectory plausibility, not just endpoint accuracy. Expert plausibility assessment requires physicians to judge whether simulated disease courses and treatment responses are clinically coherent. Silent prospective deployment must compare predicted trajectories against outcomes as they unfold, not only at a single follow-up point. Interventional testing through randomized or controlled studies must evaluate whether model-informed decisions improve outcomes relative to standard care. Post-deployment surveillance must monitor for distributional drift in both patient populations and treatment patterns, triggering recalibration when the model’s training assumptions no longer hold. To date, only the type 1 diabetes digital twin trial has reached stage four (asterisk in panel b).
Preprints 210845 g004
Table 1. A claims-to-evidence framework for medical world models. Each level defines a clinical question, the strongest claim a system at that level can support, the minimum evidence required to substantiate that claim, and a representative system. Systems listed are illustrative rather than exhaustive. The framework is normative: systems whose evidence falls below the stated minimum for their claimed level should be treated as provisional.
Table 1. A claims-to-evidence framework for medical world models. Each level defines a clinical question, the strongest claim a system at that level can support, the minimum evidence required to substantiate that claim, and a representative system. Systems listed are illustrative rather than exhaustive. The framework is normative: systems whose evidence falls below the stated minimum for their claimed level should be treated as provisional.
Level Clinical question Allowable claim Minimum evidence Representative system
L1 What is the patient’s current state? Patient state can be meaningfully compressed Reconstruction fidelity on held-out data Self-supervised imaging encoders
L2 What will happen next? Future trajectory can be forecast from history Held-out temporal accuracy Foresight, DT-GPT
L3a What happens under treatment A? Tr ajectory under a specified treatment is plausible Held-out trajectory evaluation against observed outcomes in patients who received the indexed treatment MeWM, T1D digital twin
L3b How do treatments A and B compare? Comparative treatment effect is credible Trial emulation, causal inference, or randomized data TRIALSCOPE (partial)
L4 What treatment sequence is optimal? Recommended strategy improves outcomes Prospective randomized study None yet validated
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated