4. Discussion
SteeraMed Core is the molecular-network implementation layer of SteeraMed, designed to generate auditable N-of-1 molecular evidence chains rather than validated treatment recommendations. The following discussion contextualizes the results within this framing.
4.1. Evidence-Chain Standardization for N-of-1 Individualized Intervention Reasoning
The primary implication of this work is not automated mechanism-alignment ranking. Rather, SteeraMed Core illustrates how N-of-1 individualized intervention reasoning can be organized as a standardized evidence-generation process. In current practice, individualized interventions are often selected using population-level associations, expert preference, or fragmented biomarkers. These approaches are difficult to audit, compare, or iteratively improve. A Steerable Biomedical World Model framework provides a different structure: define the individual biological state, represent candidate interventions as actions, and estimate plausible state-transition hypotheses. Response is then measured longitudinally, and the next intervention cycle is updated accordingly.
Mechanism-alignment ranking is a demonstration task in this framework. The broader target is evidence standardization for N-of-1 individualized intervention reasoning. The current study uses DNA methylation, PPI modules, and FDA-approved compound targets because they provide a tractable and auditable demonstration domain. Known-drug recovery is used as a retrospective positive-control task, not as the final purpose of the system.
4.2. FDA PMF-Inspired Evidence Categories, Not Regulatory Sufficiency
SteeraMed Core should not be interpreted as meeting FDA PMF evidentiary requirements. Inspired by FDA's Plausible Mechanism Framework, we organize individualized evidence into five conceptual layers: molecular abnormality, intervention-mechanism linkage, reference-state context, target engagement, and clinical or biomarker response. We emphasize that SteeraMed Core does not implement or satisfy FDA evidentiary requirements; rather, it borrows the epistemic structure of mechanism-based individualized evidence to guide computational evidence-chain design.
FDA's Plausible Mechanism Framework provides a useful regulatory-science analogy for organizing individualized mechanism evidence, particularly around abnormality definition, mechanistic linkage, reference context, target engagement, and clinical or biomarker response.
Figure 9 summarizes how SteeraMed Core's evidence-chain components map to these PMF-inspired evidence categories as a conceptual reference. SteeraMed Core generates computational artifacts corresponding to the first two PMF-inspired evidence categories: molecular-state characterization and intervention-mechanism linkage. These artifacts are not regulatory evidence and do not establish target engagement or clinical benefit. Reference-state context is approximated through retrospective age/sex-matched controls, though FDA's intent likely refers to prospective natural history studies. Target engagement and clinical or biomarker response require prospective experimental and clinical evidence beyond SteeraMed Core's computational scope.
FDA individualized-therapy and PMF discussions provide a useful precedent. They emphasize that individualized interventions require a defined abnormality, a mechanistically justified action, reference context, target engagement, and longitudinal evidence of response. N-of-1 individualized intervention reasoning faces a similar epistemic challenge: how to make individualized interventions explainable, testable, comparable, and updatable.
4.3. SteeraMed Core as the Current Implementation of a Steerable Biomedical World Model
In a companion paper, we outlined the general SteeraMed architecture as a steerability-oriented framework for biomedical world models [
10]. The present study reports SteeraMed Core, the molecular-network implementation layer of that architecture. SteeraMed Core implements only a subset of the full SteeraMed system: molecular state representation through promoter-level methylation deltas, action representation through compound-target profiles, and mechanism alignment through target localization within perturbed PPI modules (
Supplementary Table S2). It does not yet implement a learned transition model, multi-step planning, dose-response modeling, or prospective quality-control feedback.
The world model framing thus serves as an aspirational architecture rather than a fully realized planning system. It generates mechanism-alignment and transition hypotheses rather than validated transition predictions. Prospective paired pre/post-treatment methylome datasets, target-engagement measurements, and clinical outcome data will be required to upgrade SteeraMed from a static molecular evidence-chain generator into a calibrated transition model. This is consistent with the broader trajectory of world models in AI, where progress required large-scale (state, action, next_state) datasets that are not yet available in biomedicine.
4.4. Boundary Conditions and Signal Determinants
The systematic variation in Recall-10 across the three diseases (20.0% to 51.7%) is not noise but a systematic pattern reflecting four biological and technical factors. First, positive drug coverage in STITCH: RA had 14 drugs with verified STITCH CIDs, BC had only 6; more positive drugs increase the probability of at least one hit per patient. Second, disease methylation effect size: RA produces large methylation changes in immune-related genes (mean delta ~0.02 in top PPI modules), producing stronger alignment scores, while BC and depression show moderate effect sizes. Third, positive control count and candidate pool size: depression had 32 positive controls among 966 candidates (baseline ~28.4%), the highest baseline across all three diseases, versus RA (~8.9% with 14 positives) and BC (~3.9% with 6 positives). Fourth, tissue relevance: RA showed the strongest absolute signal, consistent with blood being the disease-relevant tissue and strong immune-related methylation signals. BC showed moderate signal, limited by the fact that blood methylation may not fully represent tumor tissue biology, and by STITCH gaps for key BC drugs (tamoxifen, letrozole). Depression showed a complex pattern: the combined Recall-10 (25.3%) did not exceed the random baseline (~28.4%), but nutraceutical positive controls revealed above-baseline enrichment in specific age-sex sub-cohorts (
Section 3.3), suggesting that depression's molecular heterogeneity requires stratified analysis rather than aggregate evaluation.
These differences should be interpreted as boundary conditions rather than failures or universal performance estimates. They suggest that SteeraMed performance depends on sample relevance, tissue-disease alignment, target database coverage, and metadata quality. This context-dependence is itself informative: it indicates where the current evidence-chain architecture is most applicable and where future improvements in data infrastructure will be most impactful.
The signal attenuation experiment (
Section 3.2) provides a mechanistic explanation for this context-dependence. Under strong-signal conditions (established RA), single-gene and PPI-module representations performed within 5 percentage points of each other (18.7% vs 24.0% Recall-10), suggesting that when disease signal is abundant, module-level aggregation provides modest additional benefit. However, as signal-to-noise ratio decreased, single-gene representations collapsed below the random baseline (3.0% at α = 0.2), while PPI-module representations retained above-baseline performance (11.8%). This divergence indicates that module-level aggregation becomes increasingly important as molecular perturbations become subtler — precisely the regime relevant to early disease detection and healthy aging. Notably, curated pathway gene sets showed structural bias, performing below baseline at clean signal but above baseline under noise, likely reflecting preferential targeting of disease-relevant pathways by known RA drugs rather than genuine signal recovery.
Statistical interpretation of module-level aggregation. PPI modules are correlated but partially independent units: genes within a module share functional context (correlation), but different modules capture different biological processes (partial independence). Module-level aggregation therefore differs from both single-gene analysis (no aggregation, high variance) and pathway-level analysis (large, overlapping gene sets with potential structural bias). The SA score's Welch-type contrast statistic treats each module as a unit of analysis, comparing target-gene deltas against non-target-gene deltas within that module. This within-module comparison naturally controls for module-wide methylation shifts (e.g., global hypomethylation in immune cells), while the across-module voting aggregates independent alignment signals. The resulting N-of-1 evidence chain reflects the convergence of multiple partially independent module-level rankings rather than a single aggregate statistic, which is why it remains informative even when individual module signals are weak. The contrast statistic is used for ranking only; its magnitude should not be interpreted as a calibrated effect size or inferential p-value, because genes within modules are not independent.
4.5. Future Applications to Healthy Aging and Longevity Medicine
Healthy aging and longevity medicine remain important future application domains for SteeraMed, but they are not directly validated by the present disease-cohort analyses. Longevity interventions differ from disease treatment in objectives, time horizon, outcome measures, and acceptable evidence standards. A longevity-specific implementation would require healthy-aging molecular cohorts, aging- or resilience-associated modules, geroprotector intervention vocabularies, and paired pre- and post-intervention molecular and functional measurements.
The relevance of SteeraMed to longevity medicine therefore lies in its evidence architecture rather than in the specific disease-cohort results. Future longevity studies should replace disease-associated perturbation modules with aging-, resilience-, intrinsic-capacity-, or functional-decline-associated modules [
20], and should replace therapeutic positive controls with geroprotective or health-optimization intervention vocabularies. Such studies will be necessary before SteeraMed can be claimed as a validated framework for N-of-1 longevity medicine. The aging cohort analysis (
Section 3.4) provides direct support for this direction: using phenotype-calibrated parameters (PPI 20-800, CHEM >= 5, matched to the lower STITCH target counts of geroprotectors), SteeraMed Core achieved 1.8-fold enrichment over random baseline and identified niacin (NAD+ precursor) and colchicine (anti-senescence) as top-ranked compounds — both with independent published geroprotective evidence. Notably, while per-patient voting successfully recovered geroprotectors (15.9% Recall-10), group-level averaging failed (bootstrap fold 0.02x), highlighting the high inter-individual heterogeneity of aging signals and the importance of N-of-1 approaches in longevity applications. This result supports the core design principle that PPI module size should match drug target profiles — geroprotectors with fewer STITCH targets naturally pair with smaller PPI modules, just as high-target-count RA drugs pair with larger modules.
The present study does not validate SteeraMed for longevity medicine. Instead, it establishes a general molecular evidence-chain framework for N-of-1 individualized intervention reasoning, with longevity medicine treated as a future application domain requiring prospective healthy-aging datasets.
4.6. Claims and Evidence Boundaries
To clarify what the present study does and does not establish, we summarize the evidence boundaries in
Table 2.
Table 2.
Claims and Evidence Boundaries.
Table 2.
Claims and Evidence Boundaries.
| Claim |
Supported by current study? |
Evidence |
Boundary |
| SteeraMed Core can represent individual molecular state |
Yes |
Methylation deltas mapped to PPI modules |
Whole-blood promoter methylation only; composite signal |
| SteeraMed Core can generate auditable evidence chains |
Yes |
Four-layer patient-level reports |
Mechanism hypotheses only |
| SteeraMed Core enriches known interventions in RA and BC |
Partially yes |
Above-baseline positive-control recovery |
Disease-informed feature selection and STITCH coverage bias |
| Depression shows aggregate positive-control enrichment |
No for combined controls |
Combined Recall-10 below random baseline |
Stratified nutraceutical signals are exploratory |
| Aging analysis validates geroprotector discovery |
No |
Literature-convergent exploratory candidates |
No clinical ground truth or longitudinal validation |
| SteeraMed Core predicts clinical efficacy |
No |
Not tested |
Requires prospective intervention studies |
| SteeraMed Core is a complete world model |
No |
Static mechanism-alignment only |
No learned state-action-next-state transition model |
| SteeraMed Core satisfies FDA PMF |
No |
Conceptual mapping only |
Not regulatory evidence |
4.7. Prospective N-of-1 Roadmap
We envision SteeraMed's development in four phases:
Phase 1 (completed): Retrospective positive-control evaluation. Using public GEO data, SteeraMed's molecular evidence chains ranked at least one known therapeutic drug among the top-10 candidates for 51.7% of RA patients and 20.0% of BC patients (primary pipeline). Depression (GSE128235, N=324 cases) showed a combined Recall-10 of 25.3% that did not exceed the random baseline, but revealed a drug-nutraceutical divergence with nutraceutical enrichment in specific sub-cohorts. Variable but above-baseline enrichment stability was observed for RA and BC. All data, code, and results are available for reproduction.
Phase 2 (near-term): Prospective observational N-of-1 registry. The key step is establishing a longitudinal registry that collects baseline methylome, proteome, metabolome, wearable data, functional measurements, and intervention exposure records from individuals engaged in health optimization or longevity programs. With current EPIC v2 array turnaround times, this is technically feasible.
Phase 3 (medium-term): Prospective intervention-cycle studies. Pre/post intervention measurements for lifestyle, nutrition, exercise, supplements, pharmacological agents, and cellular interventions. Each intervention cycle generates a (state, action, next_state) observation that can be used to test SteeraMed's mechanism-alignment predictions against actual molecular responses.
Phase 4 (long-term): Calibrated transition learning. Learn state-action-next-state models from repeated N-of-1 intervention cycles, converting SteeraMed from a static evidence-chain generator into a calibrated transition model that can predict post-intervention molecular states.
For investigational pharmacological interventions, relevant regulatory pathways may apply. However, the broader SteeraMed architecture is intended as an evidence-standardization framework across N-of-1 individualized interventions, not as a single-patient IND pathway.
4.8. Limitations
1. Study design and clinical evidence: All evaluation is retrospective using public GEO data; prospective clinical evaluation is needed. Known therapeutic drugs are included in the chemical pool used for ranking, reflecting SteeraMed's intended use case (ranking all FDA-approved compounds) but meaning evaluation is not fully blind. Leave-one-drug-out analysis shows no single drug drives results. SteeraMed provides computational rationale, not evidence of clinical benefit.
2. Feature selection and scoring assumptions: The top-100 PPI modules used in scoring are selected based on one-sample t-tests against the case-control delta distribution, introducing feature selection bias intrinsic to the pipeline design. Leave-one-drug-out analysis shows removing any single positive drug reduces Recall-10 by only 0.5 percentage points on average. Additionally, the steerability alignment score uses a Welch-type contrast statistic comparing target vs. non-target gene methylation deltas within PPI modules, but genes within the same module are not independent (connected through PPI and potentially co-regulated). The contrast statistic is used as a ranking feature, not for inferential p-value interpretation. Its magnitude should not be interpreted as a calibrated effect size, and no multiple-comparison correction is applied to individual scores.
3.
Methylation as a proxy for drug-target engagement: SteeraMed uses whole-blood promoter methylation deltas as its molecular-state representation, but the chain from DNA methylation to drug mechanism is long and indirect. (a) Promoter methylation is an imperfect proxy for gene expression: methylation-expression correlations are context-dependent and vary by gene, tissue, and CpG location. (b) Methylation deltas may reflect cell-type composition shifts rather than cell-intrinsic epigenetic changes, particularly in immune-mediated diseases like RA. (c) Whole blood has limited tissue relevance for breast cancer (tumor tissue), depression (brain), and aging (multi-tissue). (d) Drug targets are proteins, and protein activity is not directly captured by promoter methylation of the encoding gene. (e) The exploratory methylation-state concordance analysis (
Figure 5A) showed that inflammation-related drug targets tend to show coordinated methylation shifts in RA blood, but this does not establish pharmacodynamic directionality, gene-expression changes, or drug-target engagement. Without matched gene-expression, protein, cell-composition, or target-engagement data, promoter methylation cannot establish whether a drug target is transcriptionally active, inhibited, or pharmacologically engaged.
4.
Statistical assessment limitations: (a) The per-patient permutation test was applied to 20 of 354 RA patients (5.6%), with 7/20 reaching significance; the sampling rate limits generalizability. (b) Individual compound rankings showed moderate bootstrap stability (known RA drugs: up to 15% of resamples), consistent with high-dimensional biomarker selection instability, though aggregate Recall-10 remains stable. (c) The permutation test randomly samples (PPI-module, compound) pairs from all eligible pairs rather than recomputing alignment scores from shuffled case/control labels; a more conservative approach would recompute from shuffled labels but is computationally prohibitive (see
Section 2.8).
5.
Database coverage, confounds, and cross-disease comparability: Many established drugs (tamoxifen, anastrozole for BC; certain DMARDs for RA) are absent from STITCH or have insufficient target annotations. Multi-target drugs (e.g., piroxicam: 212 targets) may be preferentially ranked; target-count matched baselines should be investigated. Recall-K results for different diseases used slightly different matching strategies; a harmonized pipeline yielded complementary results for RA and BC (
Supplementary Table S5). We did not adjust for blood cell-type composition; part of the SteeraMed signal may reflect disease-associated immune-cell shifts rather than cell-intrinsic methylation changes, particularly relevant for RA.
6. Depression-specific limitations: (a) Post-hoc sub-cohort selection: bootstrap was performed on the mid-age sub-cohort (36-55 years) identified from the same exploratory heterogeneity analysis used for validation, potentially inflating apparent stability. The bootstrap yielded 1.3x fold-enrichment (52% significant), substantially weaker than RA (3.1x) and BC (20.7x). This warrants independent replication with pre-specified age-stratified analysis. (b) Two of 17 depression positive-control drugs (fluvoxamine, clomipramine) have FDA primary approval for OCD rather than MDD; future sensitivity analysis excluding these would assess their impact.
7. No comparison with existing methods: Direct comparison with CMap/L1000 signature matching or network-based drug repurposing methods is beyond the scope of this paper, as SteeraMed uses a fundamentally different data modality (DNA methylation + PPI networks) than these methods (transcriptomics or chemical structures).
8. Signal attenuation experiment scope: The noise robustness comparison was conducted on RA only, using additive Gaussian noise. The relative advantage of PPI-module over single-gene representations may differ across diseases and under non-Gaussian noise. The PPI modules were pre-selected based on disease-relevant signal, which may favor PPI-module performance. Generalizability to other diseases and noise models requires independent evaluation.
9. Exploratory aging extension: The aging analysis used age-stratified groups as a phenotypic proxy, fundamentally different from the disease case-control design. The positive-control list (13 geroprotectors) is smaller and less established, and there is no true clinical ground truth for anti-aging interventions. The convergence of top-ranked compounds (niacin, colchicine) with published evidence constitutes literature-level validation, not prospective confirmation. The GSE40279 cohort is cross-sectional, precluding causal inference. This analysis should be interpreted as an exploratory extension demonstrating potential applicability beyond disease.
4.9. Future Directions
The greatest bottleneck for translating this core implementation into a calibrated world model for N-of-1 individualized intervention reasoning is not algorithmic - it is data. Current public repositories are dominated by cross-sectional disease-vs-control studies, lacking the paired pre- and post-interventional molecular datasets needed to learn state-action-next-state transitions. This data gap mirrors the challenge faced by world models in AI, where progress required large-scale (state, action, next_state) datasets.
We call for three data infrastructure investments: (1) systematic collection of paired pre/post-treatment DNA methylation data across diverse interventions, feasible with current EPIC v2 array turnaround times; (2) saliva-based methylation as a scalable, non-invasive sampling modality suitable for remote data collection; and (3) open interventional methylome repositories - we estimate that 500-1,000 paired samples across 10-20 interventions would enable the first systematic evaluation of computational steerability predictions.
Additional technical directions include: multi-omics integration (incorporating RNA-seq and proteomics data to strengthen target engagement evidence), stronger null models (target-count matched baselines, label permutation with recomputed features, leave-one-patient-out feature selection), and real-time SteeraMed Core pipeline development (automated pipeline from methylation array to evidence chain report, deployable in clinical laboratories).