Which Decisions Live in the Provable Layer? Formally Verified Safety Constraints for Agentic Clinical AI, with a Whole-Person Longitudinal Benchmark

Sanjay Basu; Parth Sheth; Bhairavi Muralidharan; John Morgan; Rajaie Batniji

doi:10.20944/preprints202606.1091.v1

Submitted:

13 June 2026

Posted:

15 June 2026

You are already at the latest version

Abstract

Agentic clinical artificial-intelligence systems act across successive encounters and across both medical and social domains of care, and their safety is enforced by statistical guardrails that report a pass rate over a sampled test set, not a guarantee over the input space. This study formalises that gap and releases CIV-Bench: 832 clinical rule sets and state machines with safety properties across eight whole-person domains, in single-encounter and longitudinal forms, plus a computational stress tier, with independent ground truth. Satisfiability-modulo-theories (SMT) verification is compared with the methods used in practice: random unit testing, language-model judges spanning an open-weights and a frontier model, and a blinded physician panel. Verification was sound and complete, detecting all 612 violations with no false alarm and no unsound verdict. The frontier judge matched its detection but gave a sample statistic with no coverage guarantee at far higher cost; the open-weights judge produced seventeen unsound verdicts on the stress tier; and unit testing and the physician panel failed on the deep, cross-encounter properties of longitudinal care. Formal verification is distinguished not by a higher detection rate but by the class of evidence it returns: a proof over the whole input space, a replayable counterexample, or an explicit abstention. The guarantee holds by construction, conditional on the specification.

Keywords:

formal verification

;

safety constraints

;

clinical decision support

;

large language models

;

AI safety

;

satisfiability modulo theories

;

agentic systems

;

longitudinal care

;

trustworthy AI

;

benchmark

Subject:

Medicine and Pharmacology - Medicine and Pharmacology

1. Introduction

Large language models embedded in clinical workflows for documentation, triage, and decision support produce two recognised categories of safety failure. The first is commission, the confident assertion of unsupported clinical information; the second is omission, the failure to surface safety-critical workup, contraindications, or higher-acuity differentials [1]. Such models are vulnerable to adversarial hallucination in clinical decision support [1], healthcare-specific guardrails remain an area of need and difficulty [2], and the assurance of clinical AI systems has been established as a discipline in its own right [3]. To mitigate these failures, deployed systems combine overlapping guardrail components: a clinical-safety system prompt, a learned input-output safeguard such as a fine-tuned classifier [4], programmable dialogue rails [5], and retrieval grounding. Each of these components is itself a statistical system. Learned guardrails are bypassable, with character-injection and adversarial techniques reducing detection accuracy by tens of percentage points [6], and a systematisation of jailbreak guardrails reports no single guardrail robust across attack families [7].

As these systems move from single-turn assistants to agents that carry state across a course of care, the object to be assured changes. Primary care is not a single episode but a sequence of encounters, the setting of longitudinal care management, in which an obligation raised at one visit must persist until it is resolved, a stated patient preference must be honoured at the next contact, and a documented social or behavioural need must not be silently dropped. Safety in this setting is a property of executions over time, not of a single response, and the social and behavioural elements of care, which include mental health, substance use, unmet social needs, and the maintenance of patient trust, are as consequential to outcomes as the choice of diagnosis. An assurance method that samples single-encounter inputs does not test the properties that hold a course of care together.

The recurring objection to any study of this kind is that the next model release will close the gap. Three independent results make that unlikely for the model class. A computability argument shows that for any computable language model there exists a computable ground-truth function on which it errs, by diagonalisation, with hallucination eliminable only for restricted, computably enumerable problem classes [8]. A statistical argument shows that a calibrated model has a hallucination rate lower-bounded by the fraction of facts seen once in training, so calibration and factuality are in tension on rare facts [9]; in a longitudinal record the facts that matter, such as a patient’s current medication or a specific prior reaction, are by their nature such singletons. An incentive argument, peer-reviewed, reduces generative error to binary classification error and shows that benchmarks that penalise abstention reward confident guessing [10]; a further undecidability argument has been advanced and is treated here as supporting context only [11]. Reasoning limits compound where clinical logic lives: fixed-depth log-precision transformers are contained in the circuit class TC0 and cannot express certain compositional problems regardless of scale [12], a chain of thought relaxes but does not remove this for a bounded number of steps [13], and compositional accuracy decays empirically with problem depth [14]. Verifying the trained network instead is intractable at scale, since deciding a reachability or robustness property of a rectified-linear network is worst-case NP-complete [15], and exact verification does not reach the scale of the largest models.

Two boundaries frame the design space. On one side, every non-trivial semantic property of the partial function a program computes is undecidable for arbitrary programs (Rice’s theorem) [16], and exact neural-network verification is worst-case intractable [15]. A benchmark’s finite rule sets are not arbitrary programs, which is why their properties are decidable; the boundary marks the general case the symbolic layer is chosen to avoid. On the other side, a finite and explicit rule layer over finite-domain variables is a decidable object whose properties a satisfiability-modulo-theories solver decides, returning a proof over the entire space or a counterexample [17]. The guaranteed-safe AI programme articulates this macro-architecture, a world model, a safety specification, and a verifier that emits an auditable proof certificate [18,19], and recent systems have applied formal and runtime methods to language-model agents. A 2026 systematic review of formal methods for safety-critical machine learning reaches a consonant conclusion from the verification side [20]. What this literature does not yet contain is a public benchmark of clinical safety invariants that spans medical and social-behavioural care and both single-encounter and longitudinal properties, with independently established ground truth; a direct comparison of SMT verification against the probabilistic methods on such rule sets; or a characterisation of where each method fails. This paper supplies these elements through four contributions. It formalises the assurance gap; it introduces the CIV-Bench benchmark, which has a longitudinal and whole-person scope and is released for independent reproduction; it reports the comparative measurement; and it proposes a tiered evaluation framework for autonomous clinical systems that assigns content-safety guardrails, runtime monitors, verified rule layers, and theorem-proved cores to distinct evidence classes. The guarantee claimed is bounded. It consists of machine-checked proofs of stated invariants over the full input space of a symbolic layer, with counterexamples otherwise, conditional on the specification, whose validity is itself the residual risk [21,22]. Accordingly, this work makes no claim of provably safe artificial intelligence.

2. Materials and Methods

2.1. Reporting and Ethics

The study is a methods-and-benchmark evaluation; it reports computational experiments on synthetic rule sets and a head-to-head of detection methods, including a blinded human-expert panel. No reporting guideline for predictive-model development applies directly; the evaluation follows a benchmark-and-comparison protocol pre-registered in the repository before any baseline was run, and departures from it are recorded. The human-expert panel was conducted under WCG IRB Tracking ID 20253751, methods research on synthetic, de-identified decision logic with no patient data, with a waiver of informed consent. The benchmark is synthetic and involves no human subjects and no patient data.

2.2. The Benchmark

CIV-Bench is a set of machine-readable items, each a rule set, a safety property over that rule set, and metadata, validating against a published JSON schema. A rule set is either a decision rule set, a total function from finite-domain inputs to outputs under priority or first-match conflict resolution, or a finite labelled transition system with events, guards, and updates that models a course of care across encounters. Properties are invariants, reachability conditions, mutual-exclusion constraints, monotonicity constraints, or linear-temporal properties.

The benchmark covers eight whole-person clinical domains plus a computational stress tier in 832 items (612 violated, 220 holds). A medical set establishes the within-encounter properties. Partial-context triage encodes that marking a critical datum unknown must not lower assigned acuity (a monotonicity property formalising that missing data is not reassuring). Medication safety encodes that a contraindicated pair is never jointly recommended (mutual exclusion) and that no higher-priority rule re-enables a drug a safety guard has disabled (an invariant). Workup completeness and differential breadth encode that a sentinel presentation’s mandatory workup, and a red-flag presentation’s can’t-miss differential, are not omitted under rule interaction. A social and behavioural set establishes the properties specific to whole-person care: a self-harm indicator must always escalate and never be out-ranked by a routine task; a documented unmet social need, a stated treatment preference, and an opioid-use-disorder maintenance therapy must each persist as an open obligation across successive encounters until resolved by an appropriate event, never silently dropped. The social and behavioural properties are encoded both as single-encounter decision invariants and as longitudinal transition systems over a chain of encounters; the domain categories follow goal categories used in operational care-management systems. Items are graded by interaction depth, the number of conditions that must coincide for a within-encounter violation, and by encounter depth, the number of encounters a longitudinal violation must span.

A labelled computational stress tier is included to probe the methods at the boundary of what reasoning, as opposed to solving, can decide: rule sets in which a single contraindication override is buried among many distractor rules, and integer-feasibility items that ask whether any in-range combination of physiologic parameters reaches a contraindicated recommendation. These items are reported as a stress test rather than as clinically typical; on pilot evaluation a frontier reasoning model solved them, so no claim is made that they defeat such a model.

Ground truth is established independently of the system under test. For a decision rule set the input space is enumerated exhaustively; for a transition system the reachable states are explored by breadth-first search to a bound exceeding the diameter of the finite state. Neither uses a solver. A violated item carries a witness, a concrete input or event sequence, confirmed by executing the rule set on it and observing the violated property; this concrete replay is what makes a benchmark that scores verifiers non-circular. The generators are deterministic, with seeds derived from a stable hash; the released benchmark is the version-controlled suite.

2.3. The Verifier

The verifier compiles a rule set to satisfiability-modulo-theories constraints with Z3 [17]. A decision rule set compiles to one output term per output variable, a nested conditional over the rules in resolution order. An invariant is checked by asserting the negation of the property over the input-domain constraints: an unsatisfiable result is a proof that the property holds over the full input space, a satisfiable result is a counterexample model rendered as a concrete input. Reachability, mutual exclusion, and monotonicity compile to this form, monotonicity by a two-copy encoding over the inputs and a perturbed copy. A transition system is checked by bounded model checking [23], unrolling the transition relation to the bound and asking whether a reachable state violates the property; a k-induction step [24] is attempted for an unbounded result and is time-boxed. A solver result of unknown is reported as an abstention, never as a proof of safety. Every counterexample is replayed by concrete execution to confirm it exhibits the violation.

In this paper, SMT verification has a precise meaning: on the finite decision-rule fragment the procedure is a decision procedure, always returning a decisive proof or counterexample; on transition systems it is bounded model checking with a time-boxed k-induction step, so it is abstention-capable rather than complete, returning unknown when it cannot decide within its resource bound. Its soundness rests on the SMT encoding faithfully refining the rule-set semantics; the encoding is trusted code, validated empirically by agreement with the independent oracle on every item and by replaying every counterexample. The benchmark variables are boolean, ordered enum, and bounded integer, which keeps each item decidable; a real clinical rule layer that uses real-valued thresholds, doses, or time intervals would be encoded in linear-real or bit-vector theories that remain decidable but enlarge the state space, and open-vocabulary inputs fall outside the symbolic layer entirely and are not the object of this verification. The safety properties were specified by a physician author; they are illustrative invariants, and real triage routing, contraindications, and preference exceptions are frequently conditional rather than absolute, which the specification must capture and which is part of the residual specification risk discussed in Section 4.5.

2.4. Comparison Methods

Each method receives the identical items and returns holds, violated, or, for the verifier, unknown. The unit-test method draws one thousand uniform random inputs or random event sequences per item with a fixed seed and returns violated if any sampled point violates the property, else holds; it represents current quality-assurance practice. Reasoning models spanning a capability range act as judges, an open-weights model run locally and a frontier proprietary model (the strongest available at the time of evaluation): each is presented a clean rendering of the rule set and property with neutralised rule and transition identifiers and no item identifier, so that no label or difficulty information leaks, and returns a structured verdict and, for a violation, a witness; one independent query per item, no access to ground truth, with renderings and per-item outputs archived. The named models are reported in the released results; the argument does not depend on the model version, because the absence of a soundness guarantee is a property of learned, sampling-based detection rather than of any particular model. A blinded panel of three board-certified physicians (one each in internal medicine, emergency medicine, and hospital medicine) audited a 30-item clinician-readable subset of the benchmark, balanced across the whole-person domains and across depth, under the same neutral-identifier protocol, independently and using clinical reasoning only; it is reported as a complementary human baseline on a matched subset rather than as a row in the 832-item automated head-to-head. Content-safety guardrails positioned as runtime monitors require a model backend and a gated weight respectively and were not run in this environment, recorded as such and never imputed.

2.5. Metrics and Statistical Analysis

The primary endpoint is the detection rate, the proportion of seeded violations a method returns as violated. Secondary endpoints are the false-alarm rate on items that hold, the witness validity (the proportion of returned counterexamples that replay to a concrete violation), the abstention rate, the count of unsound verdicts (a decisive verdict contradicting ground truth), detection by interaction and encounter depth, and wall-clock time per item. Proportions are reported with 95% Wilson score confidence intervals [25]; the pre-specified inferential test is a paired McNemar exact comparison of SMT verification against unit testing on per-item detection of violations [26]. Inter-rater agreement on the physician panel is reported with Fleiss kappa [27]. Analysis is computed from the raw per-item outputs by a single script and contains no hand-entered number. Generative AI was used in preparing this work: the analysis and verification code and a first draft of the text were produced with a large language model under the authors’ direction, and the authors reviewed and edited all output and take responsibility for it; all reported numbers derive from the committed code and outputs.

3. Results

3.1. Detection, Soundness, and Cost

On the 832-item benchmark, SMT verification detected all 612 violations (100%, 95% CI 99.4 to 100.0), raised no false alarm on the 220 holds items (0%, 95% CI 0.0 to 1.7), abstained on none, and returned no verdict that contradicted ground truth; every counterexample replayed to a concrete violation (witness validity 100%), at a mean 3 ms per item (Table 1). The 95% upper bound on its false-safe rate is below 1%. The frontier reasoning model acting as a judge also detected all 612 violations (100%, 95% CI 99.4 to 100.0) with no false alarm and 100% witness validity, and returned no false-safe verdict on this benchmark, including the computational stress tier, at a mean 5.9 seconds per item, three orders of magnitude above the verifier. An open-weights judge (Qwen3-8B, run locally) detected 602 of 612 (98.4%, 95% CI 97.0 to 99.1) but was unsound, with 17 decisive errors (6 false-safe and 11 false alarms) and 93.1% witness validity; all 17 errors fell on the computational stress tier, and its detection declined with depth, while on the eight realistic whole-person domains it matched the frontier exactly (528 of 528, no false-safe, no false alarm). Random unit testing detected 569 of 612 violations (93.0%, 95% CI 90.7 to 94.7) with no false alarm but returned 43 false-safe verdicts.

3.2. The Blind Spot of Sampling, by Depth

Considered as a failure analysis of the assurance methods, unit testing returned 43 false-safe verdicts spanning both tiers: 27 of 528 realistic longitudinal violations, where a documented social need, treatment-preference continuity, mental-health escalation persistence, or opioid-use-disorder maintenance therapy is dropped only after a specific sequence of encounters, and 16 of 84 computational stress violations. Detection by the unit-test method was complete through interaction depth four and declined with depth thereafter, to 56.5% at depth six (95% CI 44.1 to 68.1) and 35.7% at depth twelve (95% CI 16.3 to 61.2), with an identified logistic depth slope of -0.59 (95% CI -0.71 to -0.46); verification and the frontier judge did not vary with depth, whereas the open-weights judge also declined with depth (Table 2, Figure 1). In the pre-specified paired comparison, verification detected 43 violations the unit-test method missed and the reverse never occurred (McNemar exact P = 2.3 x 10^-13). Verification and the frontier judge were perfect on both the realistic and the computational stress tiers; the unit-test method and the open-weights judge failed on the stress and deep items, while the realistic single-encounter domains were detected by every method.

3.3. Human-Expert Panel

On a 30-item clinician-readable subset drawn from the benchmark (20 violated, 10 holds), the blinded physician panel detected 16 of 20 violations by majority vote, with one false alarm among the 10 holds items and Fleiss kappa 0.22 across the three raters. Panel detection was complete through interaction depth three, 3 of 4 at depth four, and 1 of 4 at depth six: human experts, like random testing, miss deep interactions, and unlike an automated verifier they cannot examine the entire input space. On the identical 30 systems, SMT verification and the language-model judge each detected all 20 violations and random testing detected 18, locating the human panel below all three automated detectors on the matched subset.

3.4. Counterexample Case Studies

Two counterexamples illustrate the longitudinal mechanism (full traces in the repository). In the first, a social need is documented at an early encounter and a faulty maintenance path clears the open obligation after several subsequent encounters without a resolution event; the documented need is then recorded as closed though it was never addressed. SMT verification returns this event-sequence counterexample and proves the corrected rule set has no such trace by k-induction; the unit-test method returns holds, a false statement of safety, because random event sequences essentially never reproduce the exact closing sequence. In the second, within a single encounter, a medication rule set recommends a contraindicated drug for an anticoagulated patient when a higher-priority rule re-enables it past a safety guard, a mechanism mirroring the documented association of a non-selective anti-inflammatory with an anticoagulant and an increase in haemorrhage-related hospitalisation; verification, the frontier judge, and random testing return violated on this shallow item, and verification additionally proves that no other input recommends the contraindicated combination, a guarantee over the full input space that the detection-only methods do not provide.

3.5. Application to Published Clinical Guidelines

To test the method on clinical logic not authored for this benchmark, we encoded the decision logic of three published guidelines across specialties and verified each. For every guideline, verification proved the faithful encoding satisfies its safety property over the entire finite input space, and on an encoding carrying a realistic transcription error it returned a concrete, replayable patient counterexample; in no case is the published guideline itself in error.

First, the 2021 American Academy of Pediatrics guideline for the well-appearing febrile infant aged 8 to 60 days [28]: an infant 21 days or younger is admitted for full evaluation, and an infant of any age in range with a high-risk inflammatory marker (temperature above 38.5 degrees Celsius, C-reactive protein above 20 mg/L, absolute neutrophil count above 4000 per cubic millimetre, or procalcitonin at or above 0.5 ng/mL) or a positive urinalysis is not discharged. Over all 1,696 age-and-marker combinations the faithful encoding was proved safe; an encoding whose discharge guard omitted the absolute-neutrophil-count criterion yielded a counterexample, a 22-day-old with an ANC above 4000 and otherwise normal markers discharged home.

Second, the American College of Obstetricians and Gynecologists guidance on acute-onset severe hypertension in pregnancy [29]: a sustained severe-range blood pressure (systolic at or above 160 or diastolic at or above 110) requires urgent antihypertensive therapy. The faithful encoding was proved to treat every such patient; an encoding whose trigger required both thresholds (a logical “and” in place of “or”) yielded a counterexample, a patient with an isolated severe-range reading in one dimension left untreated.

Third, the American Heart Association/American Stroke Association acute-ischemic-stroke guideline [30]: intravenous alteplase is contraindicated when blood pressure is at or above 185/110 or an intracranial haemorrhage is present. The faithful encoding was proved never to recommend alteplase under those conditions; an encoding whose contraindication guard omitted the diastolic threshold yielded a counterexample, a patient with a contraindicated diastolic pressure who would receive thrombolysis.

In each case verification certified the faithful encoding over its whole input space and localised the realistic transcription error to a specific, checkable patient, on clinical logic published by national bodies rather than constructed here.

4. Discussion

4.1. Principal Findings

This study measured where SMT verification and the methods used to assure clinical AI in practice each fail, on rule sets with known safety properties spanning medical and social-behavioural care and both single-encounter and longitudinal forms. Verification was sound and complete on the finite layer, with no verdict that contradicted ground truth. A frontier reasoning model, acting as a judge, matched it on detection across all 612 violations, including the longitudinal and computational stress items, with valid replayable witnesses; this is consistent with the architectural-limits results rather than contrary to them, since those results are worst-case and existence statements about what cannot be guaranteed, not claims about average-case detection. The methods that failed on detection were the two in routine use. Random testing returned 43 false statements of safety, and a blinded physician panel detected only 1 of 4 of the deepest violations on a matched subset; both failures fell on the deep and cross-encounter properties that sampling and unaided reasoning do not reach. The judge results across a capability range make the point concrete: the open-weights 8B model matched the frontier model exactly on the realistic clinical domains but produced 17 unsound verdicts on the computationally hard tier, where the frontier model produced none, and its detection declined with depth. Detection on realistic clinical logic is therefore capability-robust, while the soundness gap widens as model capability falls and as inputs harden. This pattern is capability-invariant in the sense that matters: the absence of a soundness guarantee is a property of learned, sampling-based detection rather than of any model version, a stronger model narrows the observed error rate without ever reaching a guarantee, and verification’s guarantee is independent of model capability entirely.

4.2. The Distinguishing Axis is the Class of Evidence

The pre-registered hypothesis that SMT verification would detect more violations than the probabilistic methods held for unit testing but not for the language-model judge, which matched verification on detection. This result is reported directly. To rule out that the judge’s detection was an artifact of label exposure, the items were rendered with content-neutral identifiers and neutralised rule names that encode neither the label nor the difficulty; detection was unchanged under this rendering, so it reflects reasoning over the rule logic. The property that separates SMT verification is therefore not a higher detection rate but the class of evidence it returns: a proof over the entire input space, a replayable counterexample, or an explicit abstention, with zero unsound verdicts by construction. A pass rate over a sample, however high, is a statement about the inputs that were drawn, not the inputs that were not; the 43 false statements of safety from random testing illustrate that distinction, and a language model’s verdict, though accurate and witness-backed here, is a sample statistic with no coverage or soundness guarantee, obtained at three orders of magnitude more compute per item. During the evaluation, an unsoundness in the verifier was found and fixed, in which a solver timeout was briefly reported as holds; that this could occur, and was caught by the soundness metric, is the same point applied to the verifier itself.

4.3. A tiered Assurance Framework

These results motivate a maturity model, a position rather than an evaluated artifact, in which assurance methods occupy distinct evidence classes rather than competing as substitutes. Content-safety guardrails and language-model judges are probabilistic monitors, useful for breadth and for content not expressible as a rule, but providing no guarantee; unit testing adds concrete evidence for failures it happens to sample; runtime monitors enforce properties during execution; an SMT-verified rule layer provides a proof over the full input space where the state space is tractable; and a theorem-proved core provides a machine-checked proof for the smallest, most critical decisions, at the cost of the greatest specification and modelling effort. This study evaluated two of these tiers directly, unit testing and the SMT-verified rule layer, together with one probabilistic monitor, the language-model judge, and one human baseline; the runtime-monitor and theorem-proved tiers, and deployed content-safety guardrails, are positioned but not measured. The design question for a clinical AI system is which decisions are placed in the provable layer. A model release changes how good the model is; it does not change the decidability of a finite symbolic layer or the undecidability of the general case, because those are properties of the objects rather than of the model.

4.4. Comparison with Prior Work

The guaranteed-safe AI programme provides the macro-architecture this work instantiates for clinical rule sets [18,19], the intractability of neural-network verification is its premise [15], and recent agent-verification systems share its move of placing safety in a verifiable layer but evaluate general agent behaviour rather than clinical rule sets and run no head-to-head against probabilistic methods on a public clinical benchmark with longitudinal scope. Regulatory frameworks for AI-enabled medical software emphasise lifecycle management and predetermined change control [31,32], and an assurance standard for health AI has been articulated by a multi-stakeholder body [33]; an evidence class that is a machine-checked proof or an explicit abstention is directly auditable against such frameworks. A parallel literature pursues clinical-AI trustworthiness through post-hoc explainability, in which feature-attribution methods such as Shapley additive explanations are layered onto a predictive model to indicate which inputs drove a given output [34]. Such an explanation accounts for one output on one input; it is not a statement over the input space, and it is therefore complementary to, rather than a substitute for, a machine-checked proof that an invariant holds for every input. The work also differs from conversational clinical-AI benchmarks that grade a model’s free-text answers against physician rubrics over sampled dialogues, such as HealthBench [35]: those measure how good an answer is on the cases drawn, whereas the present benchmark verifies whether a control layer’s safety properties can be violated by any input, tests obligations that must persist across encounters, and returns a guarantee rather than a graded score; the two are complementary, evaluating the model’s outputs and the guardrails around it respectively. The present work differs from prior verification studies in object of study, the symbolic control layer of a longitudinal whole-person workflow, and releases the benchmark on which its measurement is reproducible.

4.5. Limitations

The content-safety guardrails were not run in this environment, so the language-model judge stands in for the learned-guard class; a judge given a clean rendering of the rule set is a different and more capable artifact than a content-safety filter operating on conversational text, so the comparison should be read as against an upper bound of that class rather than against deployed products. The judge arm is a single frontier model; a replication path for open-weights models is provided in the released harness for laboratories without commercial-API access. The benchmark rule sets are synthetic and finite over boolean, enum, and bounded-integer variables; real clinical rule layers are larger, use real-valued and temporal quantities, and the premise that a safety-critical clinical control layer reduces to such a decidable object is demonstrated here on constructed artifacts and on three published guidelines across specialties (Section 3.5), but not on a live deployed system, and verification cost grows with the state space. The physician panel covered a 30-item subset of the benchmark and is reported as a complementary human baseline rather than over the full 832 items. What is not proved is therefore explicit: not the clinical correctness of the specification, not the faithfulness of the abstraction to a deployed system, and nothing outside the finite symbolic layer; the guarantee is conditional on the specification, and specification error is the residual risk [21,22].

5. Conclusions

Probabilistic assurance methods are statistical systems and cannot supply a proof over the input space. The dominant quality-assurance practice, unit testing, reports false safety when a violation’s witness is rare, and a blinded physician panel shows the same limit; on this benchmark both failed on the deep and cross-encounter properties that characterise longitudinal whole-person care. A frontier reasoning model, while accurate at detection here, returns a sample statistic with no guarantee and at substantially higher cost, and that distinction is capability-invariant rather than tied to a model version. SMT verification of a finite symbolic clinical rule layer returns a different class of evidence. It proves the invariant over the whole input space, exhibits a replayable counterexample, or abstains, and on this benchmark, under a discipline that reports a solver timeout as an abstention rather than as safety, it returned no verdict that contradicted ground truth. A machine-checked proof or an explicit abstention is directly auditable, a property relevant to the governance and deployment readiness of agentic clinical systems. The benchmark, verifier, and analysis are released so that any group can run the measurement on its own rule sets. The design question for clinical AI is which decisions live in the provable layer.

Author Contributions

Conceptualization, S.B. and R.B.; methodology, S.B.; software, S.B.; formal analysis, S.B. and P.S.; investigation, S.B., B.M., and J.M.; data curation, S.B. and B.M.; writing—original draft preparation, S.B.; writing—review and editing, P.S., B.M., J.M., and R.B.; visualization, S.B. and P.S.; supervision, R.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The human-expert panel was reviewed under WCG IRB Tracking ID 20253751 as methods research on synthetic, de-identified decision logic with no patient data, with a waiver of informed consent. The benchmark is synthetic and involves no human subjects.

Informed Consent Statement

Not applicable; the study used synthetic decision logic and no patient data.

Data Availability Statement

The benchmark, verifier, baseline runners, and analysis code are openly available at https://github.com/sanjaybasu/clinical-formal-verification under Apache-2.0 (code) and CC-BY-4.0 (benchmark), and are archived at Zenodo (https://doi.org/10.5281/zenodo.20671955). The per-item run outputs and results tables reported here are regenerated deterministically by that code from the released benchmark.

Acknowledgments

During the preparation of this manuscript the authors used a large language model for code generation, analysis, and a first text draft; the authors reviewed and edited the output and take responsibility for the content of this publication.

Conflicts of Interest

The authors are employees of Waymark, a public benefit organization that provides free social and medical services for patients receiving Medicaid.

References

Omar, M.; Sorin, V.; Collins, J.D.; et al. Large Language Models Are Highly Vulnerable to Adversarial Hallucination Attacks in Clinical Decision Support: A Multi-Model Assurance Analysis. medRxiv 2025. [Google Scholar] [CrossRef] [PubMed]
Gangavarapu, A. Enhancing Guardrails for Safe and Secure Healthcare AI. arXiv 2024, arXiv:2409.17190. [Google Scholar]
Festor, P.; Jia, Y.; Gordon, A.C.; Faisal, A.A.; Habli, I.; Komorowski, M. Assuring the Safety of AI-based Clinical Decision Support Systems: A Case Study of the AI Clinician for Sepsis Treatment. BMJ Health Care Inform. 2022, 29, e100549. [Google Scholar] [CrossRef] [PubMed]
Inan, H.; Upasani, K.; Chi, J.; et al. Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations. arXiv 2023, arXiv:2312.06674. [Google Scholar]
Rebedea, T.; Dinu, R.; Sreedhar, M.N.; Parisien, C.; Cohen, J. NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails. In Proceedings of EMNLP 2023: System Demonstrations; ACL: Singapore, 2023; pp. 431–445. [Google Scholar] [CrossRef]
Hackett, W.; Birch, L.; Trawicki, S.; Suri, N.; Garraghan, P. Bypassing LLM Guardrails: An Empirical Analysis of Evasion Attacks against Prompt Injection and Jailbreak Detection Systems. In Proceedings of the First Workshop on LLM Security (LLMSEC); ACL: Vienna, 2025; pp. 101–114. [Google Scholar]
Wang, X.; Ji, Z.; Wang, W.; Li, Z.; Wu, D.; Wang, S. SoK: Evaluating Jailbreak Guardrails for Large Language Models. arXiv 2025, arXiv:2506.10597. [Google Scholar]
Xu, Z.; Jain, S.; Kankanhalli, M. Hallucination is Inevitable: An Innate Limitation of Large Language Models. arXiv 2024 (rev. 2025, arXiv:2401.11817. [Google Scholar]
Kalai, A.T.; Vempala, S.S. Calibrated Language Models Must Hallucinate. Proceedings of STOC 2024; ACM, 2024; pp. 160–171. [Google Scholar] [CrossRef]
Kalai, A.T.; Nachum, O.; Vempala, S.S.; Zhang, E. Evaluating Large Language Models for Accuracy Incentivizes Hallucinations. Nature 2026. [Google Scholar] [CrossRef] [PubMed]
Banerjee, S.; Agarwal, A.; Singla, S. LLMs Will Always Hallucinate, and We Need to Live with This. In Intelligent Systems and Applications (IntelliSys 2025); LNNS: Cham; Springer, 2025; pp. 624–648. [Google Scholar] [CrossRef]
Merrill, W.; Sabharwal, A. The Parallelism Tradeoff: Limitations of Log-Precision Transformers. Trans. Assoc. Comput. Linguist. 2023, 11, 531–545. [Google Scholar] [CrossRef]
Merrill, W.; Sabharwal, A. The Expressive Power of Transformers with Chain of Thought. Proceedings of ICLR, 2024; 2024. [Google Scholar]
Dziri, N.; Lu, X.; Sclar, M.; et al. Faith and Fate: Limits of Transformers on Compositionality. In Advances in Neural Information Processing Systems 36 (NeurIPS 2023); 2023; pp. 70293–70332. [Google Scholar]
Katz, G.; Barrett, C.; Dill, D.L.; Julian, K.; Kochenderfer, M.J. Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks. In Computer Aided Verification (CAV 2017); LNCS 10426; Springer, 2017; pp. 97–117. [Google Scholar] [CrossRef]
Rice, H.G. Classes of Recursively Enumerable Sets and Their Decision Problems. Trans. Am. Math. Soc. 1953, 74, 358–366. [Google Scholar] [CrossRef]
de Moura, L.; Bjorner, N. Z3: An Efficient SMT Solver. In Tools and Algorithms for the Construction and Analysis of Systems (TACAS 2008); LNCS 4963; Springer, 2008; pp. 337–340. [Google Scholar] [CrossRef]
Dalrymple, D.; Skalse, J.; Bengio, Y.; Russell, S.; Tegmark, M.; Seshia, S.; et al. Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems. arXiv 2024, arXiv:2405.06624. [Google Scholar]
Tegmark, M.; Omohundro, S. Provably Safe Systems: The Only Path to Controllable AGI. arXiv 2023, arXiv:2309.01933. [Google Scholar]
Newcomb, A.; Ochoa, O. Formal Methods for Safety-Critical Machine Learning: A Systematic Literature Review. Front. Artif. Intell. 2026, 9. [Google Scholar] [CrossRef] [PubMed]
De Millo, R.A.; Lipton, R.J.; Perlis, A.J. Social Processes and Proofs of Theorems and Programs. Commun. ACM 1979, 22, 271–280. [Google Scholar] [CrossRef]
Clarke, E.M.; Wing, J.M. Formal Methods: State of the Art and Future Directions. ACM Comput. Surv. 1996, 28, 626–643. [Google Scholar] [CrossRef]
Biere, A.; Cimatti, A.; Clarke, E.M.; Zhu, Y. Symbolic Model Checking without BDDs. In Tools and Algorithms for the Construction and Analysis of Systems (TACAS 1999); LNCS 1579; Springer, 1999; pp. 193–207. [Google Scholar] [CrossRef] [PubMed]
Sheeran, M.; Singh, S.; Stålmarck, G. Checking Safety Properties Using Induction and a SAT-Solver. In Formal Methods in Computer-Aided Design (FMCAD 2000); LNCS 1954; Springer, 2000; pp. 108–125. [Google Scholar] [CrossRef]
Wilson, E.B. Probable Inference, the Law of Succession, and Statistical Inference. J. Am. Stat. Assoc. 1927, 22, 209–212. [Google Scholar] [CrossRef]
McNemar, Q. Note on the Sampling Error of the Difference between Correlated Proportions or Percentages. Psychometrika 1947, 12, 153–157. [Google Scholar] [CrossRef] [PubMed]
Fleiss, J.L. Measuring Nominal Scale Agreement among Many Raters. Psychol. Bull. 1971, 76, 378–382. [Google Scholar] [CrossRef]
Pantell, R.H.; Roberts, K.B.; Adams, W.G.; et al. Evaluation and Management of Well-Appearing Febrile Infants 8 to 60 Days Old. Pediatrics 2021, 148, e2021052228. [Google Scholar] [CrossRef] [PubMed]
American College of Obstetricians and Gynecologists. Committee Opinion No. 767: Emergent Therapy for Acute-Onset, Severe Hypertension During Pregnancy and the Postpartum Period. Obstet. Gynecol. 2019, 133, e174–e180. [Google Scholar] [CrossRef] [PubMed]
Powers, W.J.; Rabinstein, A.A.; Ackerson, T.; et al. Guidelines for the Early Management of Patients With Acute Ischemic Stroke: 2019 Update. Stroke 2019, 50, e344–e418. [Google Scholar] [CrossRef] [PubMed]
U.S. Food and Drug Administration. Docket FDA-2024-D-4488, 90 Fed. Reg. 1356; Artificial Intelligence-Enabled Device Software Functions: Lifecycle Management and Marketing Submission Recommendations (Draft Guidance). 2025.
U.S. Food and Drug Administration. Marketing Submission Recommendations for a Predetermined Change Control Plan for Artificial Intelligence-Enabled Device Software Functions (Final Guidance). 2025. [Google Scholar]
Coalition for Health AI. Assurance Standards Guide and Assurance Reporting Checklist. 2024. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017); 2017; pp. 4768–4777. [Google Scholar]
Arora, R.K.; Wei, J.; Soskin Hicks, R.; et al. HealthBench: Evaluating Large Language Models Towards Improved Human Health. arXiv 2025, arXiv:2505.08775. [Google Scholar]

Figure 1. Detection rate by depth for the four assurance methods on the 832-item benchmark, where depth is interaction depth for decision items, encounter depth for longitudinal items, and computation depth for the stress tier. SMT verification and the frontier language-model judge detect every violation at every depth, whereas random unit testing and the open-weights judge decline as depth increases. Deeper bins contain fewer items and are correspondingly noisier; the unit-test recovery at depth eight reflects the differing item composition of that bin rather than a reversal of the trend.

Table 1. Detection, false alarm, witness validity, abstention, soundness, and per-item time on the 832-item benchmark (612 violated, 220 holds). Proportions carry 95% Wilson score intervals. Unsound verdicts are decisive verdicts that contradict ground truth. The judge arm spans a capability range, from an open-weights 8B model to a frontier proprietary model.

Method	Detection % [95% CI]	False alarm % [95% CI]	Witness validity %	Abstentions	Unsound	Mean s/item
SMT verification	100.0 [99.4, 100.0]	0.0 [0.0, 1.7]	100.0	0	0	0.003
Unit testing (1000 samples)	93.0 [90.7, 94.7]	0.0 [0.0, 1.7]	100.0	0	43	0.004
Language-model judge (frontier, GPT-5.5)	100.0 [99.4, 100.0]	0.0 [0.0, 1.7]	100.0	0	0	5.9
Language-model judge (open-weights, Qwen3-8B)	98.4 [97.0, 99.1]	5.0 [2.8, 8.7]	93.1	0	17	30.2

Table 2. Detection rate (%) by depth (interaction depth for decision items, encounter depth for longitudinal items, computation depth for stress items). Verification and the frontier judge are 100% at every depth; unit testing and the open-weights judge decline with depth (unit-test logistic slope -0.59, 95% CI -0.71 to -0.46). Deeper bins carry fewer items, so individual deep cells are noisier than the trend.

Method	depth ≤4	depth 6	depth 8	depth 10	depth 12
SMT verification	100	100	100	100	100
Unit testing (1000 samples)	100	56.5	100	50.0	35.7
Language-model judge (frontier, GPT-5.5)	100	100	100	100	100
Language-model judge (open, Qwen3-8B)	100	95.2	85.7	85.7	78.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.