Submitted:
13 June 2026
Posted:
15 June 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Materials and Methods
2.1. Reporting and Ethics
2.2. The Benchmark
2.3. The Verifier
2.4. Comparison Methods
2.5. Metrics and Statistical Analysis
3. Results
3.1. Detection, Soundness, and Cost
3.2. The Blind Spot of Sampling, by Depth
3.3. Human-Expert Panel
3.4. Counterexample Case Studies
3.5. Application to Published Clinical Guidelines
4. Discussion
4.1. Principal Findings
4.2. The Distinguishing Axis is the Class of Evidence
4.3. A tiered Assurance Framework
4.4. Comparison with Prior Work
4.5. Limitations
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Omar, M.; Sorin, V.; Collins, J.D.; et al. Large Language Models Are Highly Vulnerable to Adversarial Hallucination Attacks in Clinical Decision Support: A Multi-Model Assurance Analysis. medRxiv 2025. [Google Scholar] [CrossRef] [PubMed]
- Gangavarapu, A. Enhancing Guardrails for Safe and Secure Healthcare AI. arXiv 2024, arXiv:2409.17190. [Google Scholar]
- Festor, P.; Jia, Y.; Gordon, A.C.; Faisal, A.A.; Habli, I.; Komorowski, M. Assuring the Safety of AI-based Clinical Decision Support Systems: A Case Study of the AI Clinician for Sepsis Treatment. BMJ Health Care Inform. 2022, 29, e100549. [Google Scholar] [CrossRef] [PubMed]
- Inan, H.; Upasani, K.; Chi, J.; et al. Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations. arXiv 2023, arXiv:2312.06674. [Google Scholar]
- Rebedea, T.; Dinu, R.; Sreedhar, M.N.; Parisien, C.; Cohen, J. NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails. In Proceedings of EMNLP 2023: System Demonstrations; ACL: Singapore, 2023; pp. 431–445. [Google Scholar] [CrossRef]
- Hackett, W.; Birch, L.; Trawicki, S.; Suri, N.; Garraghan, P. Bypassing LLM Guardrails: An Empirical Analysis of Evasion Attacks against Prompt Injection and Jailbreak Detection Systems. In Proceedings of the First Workshop on LLM Security (LLMSEC); ACL: Vienna, 2025; pp. 101–114. [Google Scholar]
- Wang, X.; Ji, Z.; Wang, W.; Li, Z.; Wu, D.; Wang, S. SoK: Evaluating Jailbreak Guardrails for Large Language Models. arXiv 2025, arXiv:2506.10597. [Google Scholar]
- Xu, Z.; Jain, S.; Kankanhalli, M. Hallucination is Inevitable: An Innate Limitation of Large Language Models. arXiv 2024 (rev. 2025, arXiv:2401.11817. [Google Scholar]
- Kalai, A.T.; Vempala, S.S. Calibrated Language Models Must Hallucinate. Proceedings of STOC 2024; ACM, 2024; pp. 160–171. [Google Scholar] [CrossRef]
- Kalai, A.T.; Nachum, O.; Vempala, S.S.; Zhang, E. Evaluating Large Language Models for Accuracy Incentivizes Hallucinations. Nature 2026. [Google Scholar] [CrossRef] [PubMed]
- Banerjee, S.; Agarwal, A.; Singla, S. LLMs Will Always Hallucinate, and We Need to Live with This. In Intelligent Systems and Applications (IntelliSys 2025); LNNS: Cham; Springer, 2025; pp. 624–648. [Google Scholar] [CrossRef]
- Merrill, W.; Sabharwal, A. The Parallelism Tradeoff: Limitations of Log-Precision Transformers. Trans. Assoc. Comput. Linguist. 2023, 11, 531–545. [Google Scholar] [CrossRef]
- Merrill, W.; Sabharwal, A. The Expressive Power of Transformers with Chain of Thought. Proceedings of ICLR, 2024; 2024. [Google Scholar]
- Dziri, N.; Lu, X.; Sclar, M.; et al. Faith and Fate: Limits of Transformers on Compositionality. In Advances in Neural Information Processing Systems 36 (NeurIPS 2023); 2023; pp. 70293–70332. [Google Scholar]
- Katz, G.; Barrett, C.; Dill, D.L.; Julian, K.; Kochenderfer, M.J. Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks. In Computer Aided Verification (CAV 2017); LNCS 10426; Springer, 2017; pp. 97–117. [Google Scholar] [CrossRef]
- Rice, H.G. Classes of Recursively Enumerable Sets and Their Decision Problems. Trans. Am. Math. Soc. 1953, 74, 358–366. [Google Scholar] [CrossRef]
- de Moura, L.; Bjorner, N. Z3: An Efficient SMT Solver. In Tools and Algorithms for the Construction and Analysis of Systems (TACAS 2008); LNCS 4963; Springer, 2008; pp. 337–340. [Google Scholar] [CrossRef]
- Dalrymple, D.; Skalse, J.; Bengio, Y.; Russell, S.; Tegmark, M.; Seshia, S.; et al. Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems. arXiv 2024, arXiv:2405.06624. [Google Scholar]
- Tegmark, M.; Omohundro, S. Provably Safe Systems: The Only Path to Controllable AGI. arXiv 2023, arXiv:2309.01933. [Google Scholar]
- Newcomb, A.; Ochoa, O. Formal Methods for Safety-Critical Machine Learning: A Systematic Literature Review. Front. Artif. Intell. 2026, 9. [Google Scholar] [CrossRef] [PubMed]
- De Millo, R.A.; Lipton, R.J.; Perlis, A.J. Social Processes and Proofs of Theorems and Programs. Commun. ACM 1979, 22, 271–280. [Google Scholar] [CrossRef]
- Clarke, E.M.; Wing, J.M. Formal Methods: State of the Art and Future Directions. ACM Comput. Surv. 1996, 28, 626–643. [Google Scholar] [CrossRef]
- Biere, A.; Cimatti, A.; Clarke, E.M.; Zhu, Y. Symbolic Model Checking without BDDs. In Tools and Algorithms for the Construction and Analysis of Systems (TACAS 1999); LNCS 1579; Springer, 1999; pp. 193–207. [Google Scholar] [CrossRef] [PubMed]
- Sheeran, M.; Singh, S.; Stålmarck, G. Checking Safety Properties Using Induction and a SAT-Solver. In Formal Methods in Computer-Aided Design (FMCAD 2000); LNCS 1954; Springer, 2000; pp. 108–125. [Google Scholar] [CrossRef]
- Wilson, E.B. Probable Inference, the Law of Succession, and Statistical Inference. J. Am. Stat. Assoc. 1927, 22, 209–212. [Google Scholar] [CrossRef]
- McNemar, Q. Note on the Sampling Error of the Difference between Correlated Proportions or Percentages. Psychometrika 1947, 12, 153–157. [Google Scholar] [CrossRef] [PubMed]
- Fleiss, J.L. Measuring Nominal Scale Agreement among Many Raters. Psychol. Bull. 1971, 76, 378–382. [Google Scholar] [CrossRef]
- Pantell, R.H.; Roberts, K.B.; Adams, W.G.; et al. Evaluation and Management of Well-Appearing Febrile Infants 8 to 60 Days Old. Pediatrics 2021, 148, e2021052228. [Google Scholar] [CrossRef] [PubMed]
- American College of Obstetricians and Gynecologists. Committee Opinion No. 767: Emergent Therapy for Acute-Onset, Severe Hypertension During Pregnancy and the Postpartum Period. Obstet. Gynecol. 2019, 133, e174–e180. [Google Scholar] [CrossRef] [PubMed]
- Powers, W.J.; Rabinstein, A.A.; Ackerson, T.; et al. Guidelines for the Early Management of Patients With Acute Ischemic Stroke: 2019 Update. Stroke 2019, 50, e344–e418. [Google Scholar] [CrossRef] [PubMed]
- U.S. Food and Drug Administration. Docket FDA-2024-D-4488, 90 Fed. Reg. 1356; Artificial Intelligence-Enabled Device Software Functions: Lifecycle Management and Marketing Submission Recommendations (Draft Guidance). 2025.
- U.S. Food and Drug Administration. Marketing Submission Recommendations for a Predetermined Change Control Plan for Artificial Intelligence-Enabled Device Software Functions (Final Guidance). 2025. [Google Scholar]
- Coalition for Health AI. Assurance Standards Guide and Assurance Reporting Checklist. 2024. [Google Scholar] [CrossRef]
- Lundberg, S.M.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017); 2017; pp. 4768–4777. [Google Scholar]
- Arora, R.K.; Wei, J.; Soskin Hicks, R.; et al. HealthBench: Evaluating Large Language Models Towards Improved Human Health. arXiv 2025, arXiv:2505.08775. [Google Scholar]

| Method | Detection % [95% CI] | False alarm % [95% CI] | Witness validity % | Abstentions | Unsound | Mean s/item |
| SMT verification | 100.0 [99.4, 100.0] | 0.0 [0.0, 1.7] | 100.0 | 0 | 0 | 0.003 |
| Unit testing (1000 samples) | 93.0 [90.7, 94.7] | 0.0 [0.0, 1.7] | 100.0 | 0 | 43 | 0.004 |
| Language-model judge (frontier, GPT-5.5) | 100.0 [99.4, 100.0] | 0.0 [0.0, 1.7] | 100.0 | 0 | 0 | 5.9 |
| Language-model judge (open-weights, Qwen3-8B) | 98.4 [97.0, 99.1] | 5.0 [2.8, 8.7] | 93.1 | 0 | 17 | 30.2 |
| Method | depth ≤4 | depth 6 | depth 8 | depth 10 | depth 12 |
| SMT verification | 100 | 100 | 100 | 100 | 100 |
| Unit testing (1000 samples) | 100 | 56.5 | 100 | 50.0 | 35.7 |
| Language-model judge (frontier, GPT-5.5) | 100 | 100 | 100 | 100 | 100 |
| Language-model judge (open, Qwen3-8B) | 100 | 95.2 | 85.7 | 85.7 | 78.6 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.