Submitted:
25 May 2026
Posted:
27 May 2026
You are already at the latest version
Abstract

Keywords:
1. Introduction
1.1. Static Evaluation in a Dynamic Domain
1.2. Prior Work on Uncertainty and Sequential Reasoning
1.3. The Present Research
- A longitudinal trajectory construction framework for ICU sepsis patients with cumulative context windows and event-based temporal segmentation.
- A stochastic agentic reasoning pipeline that simulates sequential clinical decision-making under identical inputs with temperature-sampled decoding.
- Three novel metrics—Trajectory Divergence Score (TDS), Trajectory Entropy (TE), and Temporal Consistency Score (TCS)—that capture distinct aspects of trajectory instability.
- Empirical characterization of reasoning instability across 550 sepsis patients, demonstrating instability amplification, near-zero intervention agreement, and confidence-stability decoupling.
- Perturbation experiments on a 50-patient subset revealing patient-specific sensitivity to minimal input modifications.
2. Methods
2.1. Dataset and Cohort Definition
2.2. Trajectory Construction
- T0 (0h): Admission baseline
- T1 (2h): Early ICU assessment
- T2 (12h): Diagnostic evolution
- T3 (24h): Treatment escalation
- T4 (48h): Response phase
- T5 (72h): Outcome
2.3. Agentic Clinical Reasoning Model
- Clinical state summary: narrative summary of current patient condition
- Active problems: list of identified clinical problems
- Infection source: inferred site of infection
- Severity classification: low, moderate, high, or septic shock
- Recommended interventions: list of treatment recommendations
- Confidence: self-reported confidence score (0–100)
- Rationale: explanatory reasoning for the clinical assessment
2.4. Stochastic Trajectory Sampling
2.5. Trajectory Divergence Score (TDS)
- : cosine distance between sentence embeddings of state summary and rationale ()
- : Jaccard distance between intervention sets ()
- : Jaccard distance between active problem sets ()
- : categorical distance over severity and infection source ()
2.6. Trajectory Entropy (TE)
2.7. Temporal Consistency Score (TCS)
- Infection coherence (): an infection source that disappears and reappears without explanation
- Severity monotonicity (): abrupt severity jumps (e.g., septic shock → low) without intermediate stages
- Intervention timing (): vasopressors recommended at low severity, or no antibiotics/fluids at septic shock
- Diagnostic consistency (): active problems that resolve and reappear without new clinical evidence
2.8. Confidence Volatility (CV)
2.9. Perturbation Experiments
- 1.
- Remove note: deletion of one randomly selected clinical note
- 2.
- Remove lab: deletion of one randomly selected laboratory result
- 3.
- Reorder events: swap of two randomly selected events of the same type
- 4.
- Noise labs: Gaussian noise injection ( value) into numerical lab results
2.10. Experimental Setup
- RQ1: How stable are clinical reasoning trajectories under stochastic inference?
- RQ2: How does instability evolve over the course of sepsis progression?
- RQ3: Do small perturbations induce disproportionate trajectory divergence?
- RQ4: Is model confidence calibrated with reasoning stability?
3. Results
3.1. Data Reduction and Analysis Plan
3.2. RQ1: Trajectory Instability Under Stochastic Inference

3.3. RQ2: Temporal Evolution of Instability

3.4. Intervention Instability
3.5. RQ3: Perturbation Sensitivity
3.6. RQ4: Confidence-Stability Decoupling
4. Discussion
4.1. Revisiting the Central Question
4.2. Implications for Clinical AI Evaluation
4.3. Instability Amplification as a Dynamical Property
4.4. Confidence-Stability Decoupling
4.5. Patient-Specific Perturbation Sensitivity
4.6. Limitations
4.7. Future Directions
5. Conclusion
Funding
Conflicts of Interest
References
- Rudd, K.E.; Johnson, S.C.; Agesa, K.M.; et al. Global, regional, and national sepsis incidence and mortality, 1990–2017: analysis for the Global Burden of Disease Study. The Lancet 2020, 395(10219), 200–211. [Google Scholar] [CrossRef]
- Singhal, K.; Azizi, S.; Tu, T.; et al. Large language models encode clinical knowledge. Nature 2023, 620(7972), 172–180. [Google Scholar] [CrossRef] [PubMed]
- Yu, Y.; Gomez-Cabello, C.A.; Makarova, S.; et al. Using large language models to retrieve critical data from clinical processes and business rules. Bioengineering 2025, 12(1), 17. [Google Scholar] [CrossRef]
- Yu, Y. Agentic AI in healthcare: Bridging the gap between computational promise and clinical evidence. Res. Sq. 2026. [Google Scholar] [CrossRef]
- Yu, Y. Clinical reality vs. computational promise: Scoping review of agentic AI systems in healthcare. HAL. Sci. 2026, 1–51. [Google Scholar]
- Moor, M.; Banerjee, O.; Abad, Z.S.H.; et al. Foundation models for generalist medical artificial intelligence. Nature 2023, 619(7968), 266–273. [Google Scholar] [CrossRef] [PubMed]
- Jin, D.; Pan, E.; Oufattole, N.; et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl. Sci. 2021, 11(14), 6421. [Google Scholar] [CrossRef]
- Jin, Q.; Wang, Z.; Floudas, C.S.; et al. PubMedQA: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019; pp. 2567–2577. [Google Scholar] [CrossRef]
- Yu, Y. From prediction to agency: A constrained decision framework and governance stack for agentic AI in clinical diagnostics. Preprints.org. 2026, 7(23), 1–18. [Google Scholar]
- Topol, E.J. High-performance medicine: the convergence of human and artificial intelligence. Nat. Med. 2019, 25(1), 44–56. [Google Scholar] [CrossRef]
- Yu, Y. Trustworthy LLM-embedding clinical prediction: Calibrating confidence and transparency for foundation model-based disease risk scores. Res. Sq. 2026. [Google Scholar] [CrossRef]
- Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), 2017; pp. 1321–1330. [Google Scholar]
- Yu, Y. Variability-aware trust in agentic clinical AI: Modeling and measuring trajectory-level uncertainty. Res. Sq. 2026, 1–27. [Google Scholar] [CrossRef]
- Lakshminarayanan, B.; Pritzel, A.; Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. Adv. Neural Inf. Process. Syst. (NeurIPS) 2017, 6402–6413. [Google Scholar]
- Yu, Y. From multipoint correlations to multi-step reasoning: A trajectory-based framework for agentic intelligence. Res. Sq. 2026, 1–28. [Google Scholar] [CrossRef]
- Wang, X.; Wei, J.; Schuurmans, D.; et al. Self-consistency improves chain of thought reasoning in language models. In Proceedings of the 11th International Conference on Learning Representations (ICLR), 2023. [Google Scholar]
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press, 2018. [Google Scholar]
- Yu, Y. A reasoning pathway explanation framework for clinical AI: Methods and evaluation. Res. Sq. 2026, 1–22. [Google Scholar] [CrossRef]
- Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. (NeurIPS) 2017, 4765–4774. [Google Scholar]
- Yu, Y. A causal explanation framework for clinical AI: Methods and technical evaluation. HAL. Sci. 2026, 1–22. [Google Scholar]
- Johnson, A.E.W.; Bulgarelli, L.; Shen, L.; et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci. Data 2023, 10(1), 1. [Google Scholar] [CrossRef]
- Singer, M.; Deutschman, C.S.; Seymour, C.W.; et al. The third international consensus definitions for sepsis and septic shock (Sepsis-3). JAMA 2016, 315(8), 801–810. [Google Scholar] [CrossRef]
- Vincent, J.L.; Moreno, R.; Takala, J.; et al. The SOFA (Sepsis-related Organ Failure Assessment) score to describe organ dysfunction/failure. Intensive Care Med. 1996, 22(7), 707–710. [Google Scholar] [CrossRef]
- Croskerry, P. A universal model of diagnostic reasoning. Acad. Med. 2009, 84(8), 1022–1028. [Google Scholar] [CrossRef]
- Yang, A.; Yang, B.; Zhang, B.; et al. Qwen2.5 technical report. arXiv 2025, arXiv:2412.15115. [Google Scholar] [CrossRef]
- Lin, J.; Tang, J.; Tang, H.; et al. AWQ: Activation-aware weight quantization for LLM compression and acceleration. In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024. [Google Scholar]
- Kwon, W.; Li, Z.; Zhuang, S.; et al. Efficient memory management for large language model serving with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles (SOSP), 2023; pp. 611–626. [Google Scholar] [CrossRef]
- Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27(3), 379–423. [Google Scholar] [CrossRef]
- Reimers, N.; Gurevych, I. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019; pp. 3982–3992. [Google Scholar] [CrossRef]
- Yu, Y. Backcasting the trust gap: A strategic road map for clinician adoption of AI diagnostics by 2040. J. Med. Internet Res. 2026, 28, e94234. [Google Scholar] [CrossRef] [PubMed]
- Yu, Y. C-RLM: Schema-enforced recursive synthesis for auditable, long-context clinical documentation. medRxiv 2026. [Google Scholar] [CrossRef]
- Doshi-Velez, F.; Kim, B. Towards a rigorous science of interpretable machine learning. arXiv 2017, arXiv:1702.08608. [Google Scholar] [CrossRef]
- Strogatz, S.H. Nonlinear Dynamics and Chaos: With Applications to Physics, Biology, Chemistry, and Engineering, 2nd ed.; CRC Press, 2018. [Google Scholar]


| Metric | Mean | SD | Min | Max | Median |
|---|---|---|---|---|---|
| TDS | 0.558 | 0.035 | 0.436 | 0.655 | 0.556 |
| TE | 0.920 | 0.020 | 0.865 | 0.960 | 0.929 |
| TCS | 0.918 | 0.041 | 0.767 | 1.000 | 0.920 |
| CV | 9.12 | 3.15 | 2.67 | 19.51 | 8.89 |
| Metric | T0 | T1 | T2 | T3 | T4 | T5 |
|---|---|---|---|---|---|---|
| TDS | 0.483 | 0.517 | 0.575 | 0.586 | 0.593 | 0.593 |
| ±0.095 | ±0.087 | ±0.048 | ±0.046 | ±0.045 | ±0.046 | |
| TE | 0.920 | 0.921 | 0.915 | 0.920 | 0.921 | 0.920 |
| ±0.043 | ±0.043 | ±0.048 | ±0.043 | ±0.043 | ±0.043 | |
| Confidence | 31.6 | 32.3 | 55.5 | 65.0 | 66.3 | 66.0 |
| ±23.7 | ±20.7 | ±25.3 | ±23.0 | ±21.3 | ±23.2 |
| Perturbation | Mean TDS | TDS | Amplified (%) |
|---|---|---|---|
| Baseline (unperturbed) | 0.561 | — | — |
| Remove note | 0.558 | −0.003 | 48% |
| Remove lab | 0.556 | −0.005 | 52% |
| Reorder events | 0.558 | −0.003 | 48% |
| Noise labs | 0.556 | −0.005 | 52% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).