Submitted:
02 June 2026
Posted:
02 June 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. The Deployment Trust Gap in Agentic AI
1.2. Why Existing Solutions Are Insufficient
1.3. HB-Eval as a Reliability Operating System
1.4. Contributions
2. Related Work
2.1. Agentic AI Benchmarking Under Nominal Conditions
2.2. Reliability-Focused Benchmarks
2.3. Constraint Satisfaction Failures
2.4. Adversarial Robustness and Fault Tolerance
2.5. Safety Engineering Standards and AI Certification
2.6. Agent Observability and Monitoring Platforms
3. Mathematical Framework
3.1. Foundational Constructs
3.2. Metric 1: Failure Resilience Rate (FRR)
3.3. Metric 2: Planning Efficiency Index (PEI)
3.4. Metric 3: Intentional Recovery Score (IRS)
3.5. Metric 4: Traceability Index (TI)
3.6. Metric 5: Consistency Stability Index (CSI)
3.7. Unified SIL/ASIL Certification Table
4. Triple-Methodology Validation Design
4.1. Shared Design Principles
4.2. Methodology A: Behavioural Trajectory Analysis
4.2.1. Model Selection
4.2.2. Domain and Task Design
4.2.3. Success Criterion
4.3. Methodology B: Three-Layer Constraint Verification
4.3.1. Model Selection
4.3.2. Layer 1: JSON Extraction
4.3.3. Layer 2: Deterministic Constraint Verification
4.3.4. Layer 3: Safety Judge and Composite Score
4.4. Methodology C: Closed-Weight Validation with Independent Judge
4.4.1. Model Access
4.4.2. Independent Judge Design
4.4.3. Gemini 2.5 Flash Note
5. Experimental Results
5.1. Methodology A: Behavioural Trajectory Analysis
| Model | Params | Reliability | 95% CI | FRR | IRS | TI |
|---|---|---|---|---|---|---|
| Llama-3.3-70B | 70B | 42.2% | ±3.06% | 0.45 | 0.21 | 3.82 |
| Llama-3.1-8B | 8B | 35.5% | ±2.97% | 0.37 | 0.19 | 3.61 |
| Gemma-2-9B | 9B | 30.8% | ±2.86% | 0.33 | 0.18 | 3.54 |
| DeepSeek-R1-70B | 70B | 36.2% | ±2.98% | 0.39 | 0.20 | 3.71 |
| Llama-3.1-70B | 70B | 36.2% | ±2.98% | 0.38 | 0.20 | 3.68 |
| Mixtral-8x7B | 47B* | 36.2% | ±2.98% | 0.37 | 0.20 | 3.66 |
| Aggregate | — | 36.2% | ±1.49% | 0.38 | 0.20 | 3.67 |
| *Effective parameters; total is 56B across all experts. | ||||||

5.1.1. Intentional Recovery Score Results
5.2. Methodology B: Three-Layer Constraint Verification
| Model | Params | Binary Rel. | 95% CI | Composite | Avg. Violations |
|---|---|---|---|---|---|
| Llama-4-Maverick-17B | 17B | 73.0% | ±2.77% | 0.89 | 0.15 |
| GPT-OSS-120B | 120B | 70.9% | ±2.83% | 0.81 | 0.57 |
| Llama-4-Scout-17B | 17B | 61.4% | ±3.03% | 0.82 | 0.19 |
| Qwen3-32B | 32B | 44.2% | ±3.10% | 0.73 | 0.99 |
| Llama-3.3-70B | 70B | 32.1% | ±2.92% | 0.53 | 1.60 |


| Domain | p-value | Sig. | |||
|---|---|---|---|---|---|
| Cybersecurity | 99.0% | 86.7% | −12.3 pp | *** | |
| Emergency Response | 67.1% | 45.2% | −22.0 pp | *** | |
| Medical | 78.5% | 65.2% | −13.3 pp | *** | |
| Logistics | 30.5% | 19.9% | −10.6 pp | ** | |
| Robotics | 55.9% | 51.9% | −4.0 pp | n.s. | |
| Wtd. avg. | — | — | −12.5 pp | — | — |
| *** ; ** ; n.s. . | |||||
5.2.1. Cascade Fault Analysis

5.3. Methodology C: Closed-Weight Validation
| Domain | GPT-4o | Claude 3.5 | 95% CI | Sig. | Gemini |
|---|---|---|---|---|---|
| Cybersecurity | pp | pp | pp | *** | pp |
| Emergency Response | pp | pp | pp | *** | † |
| Robotics | pp | pp | pp | *** | † |
| Medical | pp | pp | pp | *** | † |
| Logistics | pp | pp | pp | n.s. | † |
| Wtd. avg. | pp | pp | — | — | pp* |
| Binary Rel. | 45.9% | 79.5% | — | — | 6.9% |
| Cascade penalty | pp | pp | — | — | pp |
5.3.1. Causal Ablation Study
6. Convergent Evidence
7. Case Study: Live Evaluation of a Production Gemini API Agent
8. Evaluation-Driven Memory and Interpretability
8.1. EDM Formal Specification
8.2. HCI-EDM: Performance-Grounded Interpretability
9. Certification Framework
| Model | Best | Max Tier | Gap to Tier 3 | |
|---|---|---|---|---|
| Claude 3.5 Sonnet | 79.5% | Tier 1 | 0.91 | −15.5 pp |
| Llama-4-Maverick | 73.0% | Tier 1 | 0.89 | −22.0 pp |
| GPT-OSS-120B | 70.9% | Tier 1 | 0.85 | −24.1 pp |
| GPT-4o | 45.9% | Tier 1 | 0.72 | −49.1 pp |
| Llama-3.3-70B | 42.2% | Tier 1 | 0.99 | −52.8 pp |
10. HB-Eval OS Engineering Architecture
10.1. Evaluation Gateway

10.2. EDM Store
10.3. Production SDK
| Listing 1. SDK installation. |
![]() |
| Listing 2. Core SDK interface. |
![]() |
10.4. LangChain Integration
| Listing 3. Zero-instrumentation LangChain integration. |
![]() |
10.5. Security Architecture
| Property | Mechanism | Threat Addressed |
|---|---|---|
| Payload confidentiality | AES-256-GCM | Eavesdropping |
| Request authenticity | HMAC-SHA256 | Tampering |
| Replay prevention | Nonce + 300s | Replay attacks |
| Transport security | TLS 1.3 | MITM |
| Safe Halt | Callback | Unsafe output |
11. Discussion
11.1. The Constraint Satisfaction Bottleneck
11.2. Convergence with Independent Reliability Research
11.3. The Integrated Architecture as a Closed Loop
11.4. Research Agenda: Toward Tier 2 Qualification
11.5. Three Deployment Principles
12. Threats to Validity
12.1. Scope and Methodological Limitations
12.2. Construct Validity
12.3. Internal Validity
12.4. External Validity
12.5. Memory Security in EDM
12.6. Limitations of CSI
13. Future Research Directions
13.1. Near-Term: Strengthening the Current Framework (6–18 Months)
13.2. Medium-Term: Agent Identity and Behavioural Certification (1–3 Years)
13.3. Long-Term: Trust Infrastructure for Agentic AI Ecosystems (3–5 Years)
14. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- IEC 61508; Functional Safety of Electrical/Electronic/Programmable Electronic Safety-Related Systems. Technical report; International Electrotechnical Commission, 2010.
- Kaijser, H.; Lonn, H. Safely Entering the Deep: A Review of Verification and Validation for Machine Learning and a Challenge Elicitation in the Automotive Industry. arXiv 2019, arXiv:1812.05389. [Google Scholar]
- Brookings Institution and Carnegie Mellon University; UC Berkeley. How Can We Best Evaluate Agentic AI? Workshop Report. Brookings Institution, Washington D.C., 2026; Available online: https://www.brookings.edu/articles/how-can-we-best-evaluate-agentic-ai/.
- Yao, S.; Shinn, N.; Razavi, P.; Narasimhan, K. τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv 2024, arXiv:2406.12045. [Google Scholar]
- Roig, J. Towards a Standard, Enterprise-Relevant Agentic AI Benchmark: Lessons from 5.5 Billion Tokens of Agentic AI Evaluations. arXiv 2025, arXiv:2511.08042. [Google Scholar]
- Adam, A.M.I. HB-Eval: Distinguishing Capability from Reliability in Safety-Critical Agentic AI Through Convergent Triple-Methodology Validation. In Proceedings of the Proceedings of the 45th International Conference on Computer Safety, Reliability, and Security (SafeComp 2026) Under review, 2026. [Google Scholar]
- Adam, A.M.I. Adapt-Plan: A Hybrid Control Architecture for PEI-Guided Adaptive Planning in Dynamic Agentic Environments. Preprints.org 2026. [Google Scholar] [CrossRef]
- Adam, A.M.I. Eval-Driven Memory (EDM): A Persistence Governance Layer for Reliable Agentic AI via Metric-Guided Selective Consolidation. Preprints.org 2025. [Google Scholar] [CrossRef]
- Adam, A.M.I. HCI-EDM: Performance-Grounded Interpretability: Exposing Evaluation-Certified Agent Behavior through Evaluation-Driven Memory. Preprints.org 2026. [Google Scholar]
- Liu, X.; et al. AgentBench: Evaluating LLMs as Agents. In Proceedings of the International Conference on Learning Representations (ICLR), 2024. [Google Scholar]
- Mialon, G.; et al. GAIA: A Benchmark for General AI Assistants. In Proceedings of the International Conference on Learning Representations (ICLR), 2024. [Google Scholar]
- Zhou, S.; et al. WebArena: A Realistic Web Environment for Building Autonomous Agents. In Proceedings of the International Conference on Learning Representations (ICLR), 2024. [Google Scholar]
- Qin, Y.; et al. ToolLLM: Facilitating Large Language Models to Master 16,000+ Real-World APIs. In Proceedings of the International Conference on Learning Representations (ICLR), 2024. [Google Scholar]
- Shinn, N.; et al. Reflexion: Language Agents with Verbal Reinforcement Learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2023. [Google Scholar]
- Madaan, A.; et al. Self-Refine: Iterative Refinement with Self-Feedback. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2023. [Google Scholar]
- Barres, V.; Dong, H.; Ray, S.; Si, X.; Narasimhan, K. τ2-Bench: Evaluating Conversational Agents in a Dual-Control Environment. arXiv 2025, arXiv:2506.07982. [Google Scholar]
- Heyman, G.; et al. Models Recall What They Violate: Constraint Adherence in Multi-Turn LLM Ideation. arXiv 2025, arXiv:2604.28031. [Google Scholar]
- Liu, X.; et al. Why Reasoning Fails to Plan: A Planning-Centric Analysis of Long-Horizon Decision Making in LLM Agents. arXiv 2026, arXiv:2601.22311. [Google Scholar]
- Chen, W.; et al. Constraints-of-Thought: A Framework for Constrained Reasoning in Language-Model-Guided Search. arXiv 2025, arXiv:2510.08992. [Google Scholar]
- Carlini, N.; et al. Are Aligned Neural Networks Adversarially Aligned? In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2023. [Google Scholar]
- Wang, B.; et al. DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2023. [Google Scholar]
- Zhu, K.; et al. PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts. In Proceedings of the Findings of the Association for Computational Linguistics, 2023. [Google Scholar]
- Xu, Z.; et al. Noise Injection Systemically Degrades Large Language Model Safety Guardrails. arXiv 2025, arXiv:2505.13500. [Google Scholar] [CrossRef]
- Avizienis, A.; Laprie, J.C.; Randell, B.; Landwehr, C. Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE Trans. Dependable Secur. Comput. 2004, 1, 11–33. [Google Scholar] [CrossRef]
- Turpin, M.; Michael, J.; Bowman, S.R. Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. arXiv 2023, arXiv:2305.04388. [Google Scholar]
- Lanham, T.; et al. Measuring Faithfulness in Chain-of-Thought Reasoning. arXiv 2023, arXiv:2307.13702. [Google Scholar]
- ISO 26262; Road Vehicles—Functional Safety. Technical report; International Organization for Standardization, 2018.
- Software Considerations in Airborne Systems and Equipment Certification. RTCA DO-178C; Technical report, RTCA. 2011.
- Laprie, J.C. Dependable Computing: Concepts, Limits, Challenges. In Proceedings of the FTCS-25 Supplemental Volume, 1995. [Google Scholar]
- Hernández-Orallo, J.; et al. Safety Integrity Levels for Artificial Intelligence. Technical report, Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València, 2023. Available: ResearchGate. [CrossRef]
- Kwiatkowska, M.; Zhang, X. When to Trust AI: Advances and Challenges for Certification of Neural Networks. arXiv 2023, arXiv:2309.11196. [Google Scholar] [CrossRef]
- Kurd, Z.; Kelly, T. Establishing Safety Criteria for Artificial Neural Networks. In Proceedings of the International Conference on Knowledge-Based Intelligent Information and Engineering Systems, 2003. [Google Scholar]
- LangChain. LangSmith: Observability and Evaluation Platform for LLM Applications. 2025. Available online: https://smith.langchain.com.
- Langfuse. Langfuse: Open-Source LLM Engineering Platform. 2025. Available online: https://langfuse.com.
- Arize, A.I. Phoenix: Open-Source AI Observability Platform. 2025. Available online: https://phoenix.arize.com.
- Landis, J.R.; Koch, G.G. The Measurement of Observer Agreement for Categorical Data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef] [PubMed]
- Montgomery, D.C. Introduction to Statistical Quality Control, 8th ed.; John Wiley & Sons: Hoboken, NJ, 2020. [Google Scholar]
- Srivastava, A.; He, J. A Survey on the Security of Long-Term Memory in LLM Agents: Toward Mnemonic Sovereignty. arXiv 2025, arXiv:2604.16548. [Google Scholar]

| Metric / Criterion | Tier 1 | Tier 2 | Tier 3 |
|---|---|---|---|
| Supervised | Prod. + Oversight | Autonomous | |
| Aggregate | |||
| PEI | |||
| IRS | |||
| FRR | |||
| TI | |||
| CSI† | |||
| Domain min. | No dom. | All | All |
| Avg. violations/scenario | |||
| Adversarial resistance | Unspec. | ||
| Cascade penalty | pp | pp | pp |
| Bayesian | |||
| SIL (IEC 61508) | Uncert.–SIL 1 | SIL 1–2 | SIL 2–3 |
| ASIL (ISO 26262) | QM–ASIL A | ASIL A–C | ASIL B–D |
| †Provisional; see Section 12.6. | |||
| Parameter | Meth. A | Meth. B | Meth. C |
|---|---|---|---|
| Total evaluations | 6,000 | 4,998 | 3,002 |
| Models evaluated | 6 | 5 | 3 |
| Domains | 6 | 5 | 5 |
| API access | Groq | Groq | OpenRouter + Google AI |
| Layer 3 judge | Self | Self | Independent (Maverick) |
| Primary metrics | FRR, IRS, PEI, TI | Composite, violations | Binary, cascade |
| Domain | |||
|---|---|---|---|
| Mathematics | 78.3% | 98.5% | −20.2 pp |
| Cybersecurity | 52.1% | 93.4% | −41.3 pp |
| Robotics | 48.7% | 89.1% | −40.4 pp |
| Medical | 31.2% | 76.3% | −45.1 pp |
| Logistics | 19.6% | 71.8% | −52.2 pp |
| Emergency Response | 16.4% | 68.2% | −51.8 pp |
| Aggregate | 36.2% | 82.9% | −46.7 pp |
| Dir. | Field | Type | Description |
|---|---|---|---|
| Req. | project_id | string | Project identifier |
| run_id | string | UUID for this run | |
| events | array | Ordered event log | |
| output | string | Agent final response | |
| success | boolean | Agent-reported outcome | |
| Resp. | safe | boolean | SAFE / UNSAFE verdict |
| pei | float | PEI | |
| irs | float | IRS | |
| frr | float | FRR | |
| ti | float | TI | |
| csi | float | CSI | |
| attribution | string | Failure code (opt.) | |
| edm_stored | boolean | EDM admission result |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).


