Preprint
Article

This version is not peer-reviewed.

HB-Eval: A System-Level Reliability Evaluation and Certification Framework for Agentic AI

Submitted:

22 December 2025

Posted:

24 December 2025

You are already at the latest version

Abstract
Current evaluation paradigms for agentic AI focus predominantly on task success rates under nominal conditions, creating a critical blind spot: agents may succeed under ideal circumstances while exhibiting catastrophic failure modes under stress. We propose HB-Eval, a rigorous methodology for measuring behavioral reliability through three complementary metrics: Failure Resilience Rate (FRR) quantifying recovery from systematic fault injection, Planning Efficiency Index (PEI) measuring trajectory optimality against oracle-verified paths, and Traceability Index (TI) evaluating reasoning transparency via calibrated LLM-as-a-Judge (κ = 0.82 with human consensus). Through systematic evaluation across 500 episodes spanning three strategically selected domains (logistics, healthcare, coding), we demonstrate a 42.9 percentage point reliability gap between nominal success rates and stressed performance for baseline architectures. We introduce an integrated resilience architecture combining Eval-Driven Memory (EDM) for selective experience consolidation, Adaptive Planning for PEI-guided recovery, and Human-Centered Explainability (HCI-EDM) for trust calibration. This closed-loop system achieves 94.2% ±2.1% FRR with statistically significant improvements over base lines (Cohen’s d = 3.28, p < 0.001), establishing a rigorous methodology for transitioning agentic AI from capability demonstrations to reliability-certified deployment. We conclude by proposing a three-tier certification framework and identifying critical research directions for community validation.
Keywords: 
;  ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2025 MDPI (Basel, Switzerland) unless otherwise stated