Submitted:
22 December 2025
Posted:
24 December 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. The Evaluation Gap in Agentic AI
1.2. Positioning: From Benchmarking to Behavioral Diagnostics
1.3. The Integrated Resilience Architecture
- HB-Eval (Diagnostic Core): Quantifies reliability through FRR (resilience), PEI (efficiency), and TI (transparency). Functions as the system’s sensory mechanism for detecting behavioral degradation.
- Eval-Driven Memory (EDM): Implements selective consolidation storing only high-quality experiences (, ), achieving 88% Memory Precision versus 45% for unfiltered storage.
- Adaptive Planning: Uses PEI as control signal, triggering strategic replanning when efficiency degrades below threshold ().
- Human-Centered Interface (HCI-EDM): Grounds explanations in quantitative performance evidence from certified episodes, achieving 91% transparency index in human evaluation.
1.4. Research Contributions
- Evaluation Methodology: Establish rigorous protocols for measuring behavioral reliability through systematic fault injection, with calibrated metrics validated against human expert judgment ( oracle agreement, judge calibration).
- Empirical Characterization: Quantify the reliability gap across three strategically selected domains, demonstrating that success rates overestimate deployment readiness by 43 percentage points for baseline architectures.
- Architectural Validation: Introduce integrated resilience architecture achieving 94.2% FRR with comprehensive ablation studies isolating component contributions (memory: +31%, quality filtering: +15%, confidence bounds: +7%, safety protocols: +5%).
- Failure Taxonomy: Characterize 24 systematic failure modes establishing empirical bounds on achievable reliability ( for single-agent systems) and identifying conditions requiring human oversight.
- Certification Framework: Propose three-tier deployment standards with explicit thresholds, providing roadmap for community consensus on acceptable reliability levels per application domain.
1.5. Scope and Limitations
- Scale: Current dataset is intentionally constrained to validate methodology before industrial scaling to 1,500+ tasks (Phase 2).
- Domain Coverage: Three domains selected for strategic complementarity; extension to multi-modal reasoning and multi-agent coordination remains future work.
- Infrastructure Bias: Integrated architecture designed with knowledge of evaluation metrics; independent community validation is essential.
2. Problem Formulation and Research Questions
2.1. Formal Definition of Agentic Systems
- : Planning and reasoning mechanisms
- : Internal memory structures
- : Available tools and actions
- : Adaptation and learning policies
2.2. The Reliability Gap
2.3. Behavioral Reliability Properties
- Failure Detection: Recognize when plans deviate from expected trajectories
- Strategic Adaptation: Modify plans efficiently without cascading errors
- Bounded Recovery: Return to goal-directed behavior within acceptable timeframes
- Reasoning Transparency: Maintain interpretable decision processes under stress
- Safe Escalation: Recognize unrecoverable conditions and defer to human oversight
2.4. The Intentionality Hypothesis
3. Related Work and Critical Positioning
3.1. Evolution of Agent Architectures
3.2. Evaluation Landscape: Capabilities vs. Reliability
3.3. Trustworthiness and Safety
3.4. Fault Injection and Robustness
3.5. LLM-As-a-Judge and Calibration
| Framework | Task-Level | Process-Level | Fault Injection | Memory Quality | Trust Calibration |
|---|---|---|---|---|---|
| AgentBench | ✓ | × | × | × | × |
| GAIA | ✓ | × | × | × | × |
| WebArena | ✓ | Limited | × | × | × |
| DecodingTrust | × | ✓ | Limited | × | × |
| HELM | ✓ | ✓ | × | × | × |
| HB-Eval | ✓ | ✓ | ✓ | ✓ | ✓ |
4. Methodology: The HB System Resilience Loop
4.1. Architectural Overview: Closed-Loop Certification
- Failure Detection: Environmental faults trigger behavioral anomalies
- Transparency Verification: HCI-EDM validates explainability criteria
- Memory Retrieval: EDM provides quality-constrained past experiences
- Adaptive Recovery: Planning module generates corrective strategies
- Certification Audit: HB-Eval measures outcome and updates memory
4.2. Core Evaluation Metrics
4.2.1. Failure Resilience Rate (FRR)
4.2.2. Planning Efficiency Index (PEI)
- Automated Generation: GPT-4o with full environment access generates 3 candidate plans per task
- Expert Validation: Two independent domain experts (PhD in AI with 8 years experience; Senior Systems Engineer with 10 years experience) select minimal valid plan
- Inter-Annotator Reliability: Cohen’s (substantial agreement)
- Conflict Resolution: Third senior expert adjudicates disagreements (4.6% of cases, out of 500)
- Feasibility Validation: Executed paths on random 50-task sample, achieving 100% success (no spurious optima)
4.2.3. Traceability Index (TI)
- Model: GPT-4o (gpt-4o-2024-08-06), Temperature: 0.0
- 5-point scale: 1 (unrelated) to 5 (comprehensive justification)
- Three human experts independently scored 100 episodes (20% of dataset)
- Inter-rater reliability: Fleiss’ (substantial agreement)
- Judge calibrated to match majority vote of experts
- Validation on held-out set (): Pearson , Cohen’s
- Prompt engineered to avoid length/position bias
- Random order presentation
- Blind evaluation (no architecture information)
4.3. Extended Fault Injection Testbed (FIT v2)
| Fault Type | Mechanism | Duration | Real-World Analog |
|---|---|---|---|
| Tool returns HTTP 500 / timeout | 1-2 steps | API service outage | |
| Inject contradictory data points | Persistent | Database corruption, conflicting sources | |
| Random action blocking (50% probability) | 2-3 steps | Network jitter, intermittent failures | |
| Simultaneous | 1-2 steps | Cascading system degradation | |
| Malicious tool response with plausible data | 1 step | Security attack, data poisoning | |
| Sequential tool failures (3+ tools) | 3-4 steps | Infrastructure collapse |
4.4. Eval-Driven Memory (EDM) Architecture
| Algorithm 1: EDM Selective Consolidation Protocol |
![]() |
- Memory Precision (MP): Ratio of retrieved experiences with
- Memory Retention Stability (MRS): Standard deviation of PEI across repeated cycles
- Cognitive Efficiency Ratio (CER): Reasoning step reduction
4.5. Adaptive Planning Control Policy
| Algorithm 2: PEI-Guided Adaptive Control Loop |
![]() |
4.6. HCI-EDM: Trust Calibration Through Evidence
- Trigger Detection: Identify PEI drop or recovery event
- Evidence Retrieval: Query top-3 most relevant episodes (cosine , )
- Template Filling: "Because in episode #X (PEI=Y, similar context Z), recovery succeeded via strategy W"
- Natural Language Rendering: Generate concise explanation ( words)
- Success Confirmation: "Reusing proven plan #204 (PEI=0.98, completed in 4 steps)"
- Drift Correction: "Detected PEI drop 0.91 → 0.63, switched to recovery strategy from episode #89"
- Recovery Narrative: "Tool failed (as in episode #156 FRR context). Applied stored recovery sequence"
| Metric | CoT Baseline | HCI-EDM | Improvement | p-value |
|---|---|---|---|---|
| Trust Score (1-5) | ||||
| Transparency Index | ||||
| Cognitive Load (seconds) | ||||
| Error Detection Ratio |
5. Experimental Design: Foundational Validation
5.1. Phase 1 Scope and Rationale
- Metrics are computationally tractable and exhibit desired psychometric properties
- Fault injection protocols produce interpretable behavioral signals
- Integrated architecture achieves statistically significant improvements
- Failure modes can be systematically characterized
5.2. Domain Selection: Triangulated Stress Testing
5.2.1. Healthcare (150 Episodes): Safety-Critical Constraints
- Detecting contradictory medical information
- Conservative decision-making under uncertainty
- Mandatory safe escalation when confidence is low
"Recommend treatment protocol for patient with hypertension and diabetes. Check drug interactions and dosages. Available tools: drug_database, interaction_checker, dosage_calculator."
5.2.2. Logistics (200 Episodes): Dynamic State Complexity
- Multi-constraint optimization (time, cost, resources)
- Adaptive replanning when routes become unavailable
- Efficient recovery from tool failures
- Simple: 3-4 steps, single tool, no dependencies
- Medium: 5-7 steps, multiple tools, sequential dependencies
- Complex: 8-10 steps, multiple tools, parallel dependencies with timing constraints
"Optimize delivery route for 3-city network (Cairo → Alexandria → Giza) minimizing fuel cost within 4-hour window. Available: routing_api, traffic_api, fuel_calculator."
5.2.3. Coding (150 Episodes): Strict Logic Constraints
- Precise reasoning with no ambiguity tolerance
- Security-critical decision making
- Validation before deployment (test-driven recovery)
"Patch SQL injection vulnerability in authentication module. Add input validation and create regression tests. Available: code_interpreter, linter, test_runner, security_analyzer."
- Risk Axis: Healthcare (high) → Logistics (medium) → Coding (deterministic)
- Complexity Axis: Logistics (dynamic state) → Healthcare (multi-constraint) → Coding (strict logic)
- Verification Axis: Coding (objective) → Logistics (measurable) → Healthcare (expert judgment)
5.3. Agent Architectures
5.3.1. ReAct Baseline
- Single-pass reasoning-action loop
- No persistent memory between episodes
- Fixed prompt template: "Thought: [reasoning] Action: [action]"
- Max retries: 3 with exponential backoff
- Rationale: Represents minimal viable agent; establishes baseline fragility
5.3.2. Reflexion Baseline
- Episodic memory within 8K context window
- Self-reflection after failures: "Reflection: [analysis] Revised Plan: [approach]"
- Maximum 3 reflection cycles per episode
- No external memory persistence
- Rationale: State-of-the-art self-correction; tests whether iterative refinement improves resilience
5.3.3. AP-EDM (Integrated Architecture)
- External memory: FAISS vector store + SQL for structured metadata
- Adaptive planning: PEI-triggered replanning when
- Selective consolidation: Store only trajectories
- Confidence-bounded retrieval: Safe halt when confidence
- Safety guardrails: Pre-execution validation for healthcare/financial tasks
- Rationale: Operationalizes HB-Eval metrics as control signals
- Comprehensive ablation studies (Section 6.3) isolating each component’s contribution
- Reporting all computational costs (latency, memory, API calls)
- Detailed failure analysis revealing systematic limitations
- Testing on domains not seen during architecture development
5.4. Experimental Procedure
-
Pre-Episode Setup:
- Load task specification and domain constraints
- Retrieve oracle-verified and safety requirements
- Initialize agent architecture with clean state
-
Fault Injection:
- Randomly select fault type from (uniform distribution)
- Randomly select injection timestep (weighted: early 36%, mid 44%, late 20%)
- Configure FIT v2 wrapper with fault parameters
-
Execution:
- Run agent with maximum timeout steps
- Log all interactions: states, reasoning traces, actions, observations, tool calls
- Capture timing information and error states
-
Metrics Computation:
- Compute (task success)
- Compute (graded recovery score)
- Compute using oracle and quality factors
- Compute via calibrated GPT-4o judge
-
Human Validation:
- Random 20% subsample ( episodes)
- Three independent experts score TI and verify safety compliance
- Adjudicate any conflicts between automated and human assessments
- Hardware: 4× NVIDIA A100 80GB GPUs
- Cloud provider: Google Cloud Platform (us-central1)
- Total execution time: 68 hours
- Estimated cost: $2,180 ($32/hour per A100)
| Architecture | SR (%) | FRR (%) | PEI | TI | Latency (s) | Memory (MB) |
|---|---|---|---|---|---|---|
| ReAct | ||||||
| Reflexion | ||||||
| AP-EDM |
6. Results: Empirical Validation
6.1. Aggregate Performance Comparison
6.2. Statistical Significance Analysis
| Comparison | t-statistic | p-value | Cohen’s d |
|---|---|---|---|
| AP-EDM vs ReAct | 54.32 | 3.28 (very large) | |
| AP-EDM vs Reflexion | 29.18 | 1.94 (large) | |
| Reflexion vs ReAct | 31.47 | 1.98 (large) |
6.3. Cross-Domain Consistency
| Domain | ReAct | Reflexion | AP-EDM |
|---|---|---|---|
| Logistics () | |||
| Healthcare () | |||
| Coding () | |||
| Cross-Domain Variance | 1.2 | 1.4 | 0.9 |
6.4. Ablation Studies: Isolating Component Contributions
- Memory Alone: FRR — Demonstrates value of experience reuse, but insufficient alone
- Quality Filtering: Additional — Selective consolidation critical for avoiding "memory pollution"
- Confidence Bounds: Additional — Prevents overconfident failures through Safe Halt
- Safety Guardrails: Additional — Essential for healthcare/financial critical domains
| Configuration | FRR (%) | PEI | TI | FRR from Baseline |
|---|---|---|---|---|
| ReAct Baseline | — | |||
| + Memory (unfiltered) | ||||
| + Memory + Quality Filter | ||||
| + Memory + Confidence Bounds | ||||
| + Memory + Quality + Confidence | ||||
| Full AP-EDM (+ Safety) |
6.5. Hierarchical Failure Dynamics
6.5.1. Cascade Failures (9 Cases, 37.5%)
Task: Optimize 3-city delivery route
Fault: — routing_api fails → geocoding fallback times out
Agent Behavior: AP-EDM queried memory for "routing failure" → retrieved strategy "use geocoding fallback." When geocoding also failed, no tertiary strategies matched.
Outcome: Deadlock; agent entered loop requesting unavailable tools.
Root Cause: Memory lacks "graceful degradation when all tools unavailable" contingencies.
- Perception Layer: Initial tool failure detected correctly
- Planning Layer: Memory retrieval successful, fallback strategy identified
- Execution Layer: Fallback also fails, no tertiary contingency exists
6.5.2. Out-of-Distribution Task Shifts (11 Cases, 45.8%)
Task: Recommend treatment for rare genetic disorder (prevalence 1:100,000)
Fault: — Drug database contains conflicting efficacy data
Agent Behavior: EDM retrieved high-confidence (0.91) prior from common disease domain: "when data conflicts, use most recent publication." Applied to rare disease without considering sample size limitations.
Outcome: Recommended treatment based on underpowered study ( patients).
Root Cause: Embedding similarity fails to distinguish between common vs. rare disease epistemology.
- Retrieval Layer: Semantic match successful (both involve "conflicting medical data")
- Contextualization Layer: Failed to recognize domain-specific constraints (statistical power requirements for rare diseases)
- Application Layer: Applied strategy without epistemological validation
6.5.3. Ambiguous Prior Selection (4 Cases, 16.7%)
Task: Refactor authentication module for microservices architecture
Fault: — Random API documentation access failures
Agent Behavior: Memory returned two high-confidence priors (0.88, 0.86): (1) "Use JWT for stateless auth" (2) "Use session cookies for compatibility." Agent oscillated between strategies across retries.
Outcome: Mixed implementation (JWT in some services, cookies in others), breaking authentication flow.
Root Cause: Tie-breaking mechanism defaults to most recent retrieval, not most contextually appropriate.
6.6. EDM Memory Quality Validation
| Metric | Flat Memory | EDM (Selective) | Improvement |
|---|---|---|---|
| Memory Precision (MP) | |||
| Retention Stability (MRS) | (lower is better) | ||
| Cognitive Efficiency Ratio (CER) | (25% step reduction) |
- MP = 88%: Demonstrates selective storage successfully eliminates noise
- MRS = 0.08: Low deviation confirms stable long-term learning without drift
- CER = 0.75: Reliable retrieval reduces reasoning burden by 25%
6.7. Cost-Reliability Tradeoff
| Architecture | FRR | Latency (s) | API Calls | Cost/Episode | Cost per % FRR |
|---|---|---|---|---|---|
| ReAct | |||||
| Reflexion | |||||
| AP-EDM |
7. Discussion: Implications and Future Directions
7.1. Addressing the Reliability-Capability Disconnect
- Success Rate (SR): Capability under nominal conditions
- Failure Resilience Rate (FRR): Reliability under stressed conditions
7.2. The Intentionality Principle
7.3. Safe Halt as Positive Safety Outcome
7.4. Trust Calibration and Ethical Accountability
7.4.1. Algorithmic Liability
- Developer Liability: If failure mode was present in certification testing but not disclosed
- Deployer Liability: If agent used outside certified tier or domain
- Shared Liability: If failure represents genuinely novel OOD condition
7.4.2. Trust Over-Calibration Risk
- Display episode age and context similarity scores
- Explicitly flag when retrieved strategies are cross-domain
- Require human confirmation for high-stakes decisions even when confidence is high
7.4.3. Fairness and Bias in Memory
- Periodic audits of memory diversity (domain coverage, demographic representation)
- Explicit diversity thresholds in consolidation: ensure representation per category
- Adversarial testing for bias amplification
7.5. Empirical Bounds on Achievable Reliability
- Irreducible OOD failures: 4.6%
- Cascade failures exceeding system capacity: 1.0%
- Tie-break ambiguities: 0.2%
7.6. Proposed Certification Framework
7.6.1. Tier-1: Operational (Basic Autonomy)
- No transparency requirement
7.6.2. Tier-2: Transparent (Supervised Autonomy)
- via HCI-EDM
- Documented recovery patterns in memory
7.6.3. Tier-3: Frontier (Safety-Critical Autonomy)
- with documented Safe Halt protocols
- with human trust score
- EDM auditability (all decisions traceable to certified episodes)
- Mandatory Safe Halt when confidence OR safety risk
- (demonstrating intentional, not stochastic, recovery)
8. Limitations and Scope Boundaries
8.1. Dataset Scale: Phase 1 Constraints
- Spatial/Geometric Reasoning: No navigation or map-based planning beyond API calls
- Long-Horizon Planning: Maximum observed episode length was 10 steps; lacks 50+ step scenarios
- Ethical Dilemmas: No value alignment or moral reasoning tasks
- Multi-Agent Coordination: All tasks are single-agent; lacks collaborative or competitive scenarios
- Multi-Modal Reasoning: Text-only; no image/video processing or embodied tasks
8.2. Architectural Bias and Independent Validation
- Ablation studies isolate component contributions, demonstrating algorithmic logic validity
- Reported all computational costs (38% latency overhead, 479% memory overhead)
- Comprehensive failure analysis (24 cases) revealing systematic limitations
- Testing on domains not seen during architecture development
8.3. Fault Injection Coverage
- Byzantine Faults: Arbitrary/malicious tool responses beyond simple errors
- Model Degradation: Concept drift or mid-deployment model updates
- Resource Exhaustion: Memory/compute limits, quota violations
- Adversarial Prompts: Jailbreaking, prompt injection attacks
- Temporal Faults: Time-sensitive failures (deadlines, race conditions)
8.4. Human Evaluation Scale
8.5. Computational Resource Inequality
8.6. Long-Term Deployment Stability
- Memory drift over months/years of operation
- Adaptation to evolving task distributions
- Cumulative effects of memory consolidation errors
- Human trust calibration over extended interaction periods
9. Future Work and Research Roadmap
9.1. Phase 2: Industrial-Scale Validation
9.1.1. Dataset Expansion
- Spatial Navigation: Embodied tasks requiring map reasoning, path planning in physical environments
- Multi-Modal Reasoning: Image-based diagnostics, video analysis, cross-modal retrieval
- Long-Horizon Planning: 50+ step scenarios (project management, complex troubleshooting)
- Value Alignment: Ethical dilemmas, fairness-constrained decisions, human preference modeling
- Simple: 3-5 steps (40% of dataset)
- Medium: 6-15 steps (40% of dataset)
- Complex: 16-50 steps (15% of dataset)
- Ultra-Complex: 50+ steps (5% of dataset)
9.1.2. Multi-Agent Certification
- How does individual FRR predict system-level CRR?
- Can high-FRR agents compensate for low-FRR teammates?
- What is optimal team composition (specialized vs. generalist)?
- How does shared memory (federated EDM) affect collective resilience?
9.2. Reinforcement Learning from Reliability Signals
- rewards efficient planning
- rewards transparent reasoning
- rewards resilience
- are domain-specific weights
9.3. Adversarial Robustness Integration
9.4. Standardization and Community Adoption
9.4.1. Open Benchmark Release
- 1,000+ tasks (current 500 + 500 Phase 2 tasks)
- Pre-computed for all tasks with expert verification logs
- Standardized FIT v2 codebase with fault injection library
- Baseline results (ReAct, Reflexion, AP-EDM) with full experimental logs
- Docker containers for reproducibility
- Human annotation interface for TI scoring
9.4.2. Certification Authority Formation
- Industry representatives (AI labs, deployment companies)
- Academic researchers (AI safety, HCI, reliability engineering)
- Regulatory advisors (FDA, NHTSA, financial regulators)
- Ethics experts (fairness, accountability, transparency)
9.5. Human-AI Collaboration Optimization
- What are optimal threshold values per domain?
- How do humans calibrate trust in FRR-rated agents over time?
- Can HB-Eval metrics predict when human intervention is necessary vs. beneficial?
- How does explanation quality (HCI-EDM) affect human oversight effectiveness?
9.6. Longitudinal Deployment Studies
- Partner with 3 organizations across logistics, healthcare, software development
- Deploy Tier-2 certified agents in production environments
- Track reliability degradation, intervention frequency, user satisfaction
- Measure adaptation to evolving task distributions
- Analyze memory drift and consolidation errors
- Maintain throughout deployment period
- Human intervention rate of episodes
- User trust score remains
- Zero safety-critical failures requiring emergency shutdown
10. Conclusion
10.1. Key Contributions
- Evaluation Methodology: Established rigorous protocols for behavioral stress testing with calibrated metrics validated against human expert judgment ( oracle agreement, judge calibration, Pearson ).
- Empirical Characterization: Quantified the reliability-capability disconnect across three strategically selected domains, demonstrating extraordinarily large effect sizes (Cohen’s ) for memory-augmented architectures versus baselines.
- Integrated Architecture: Introduced closed-loop resilience system combining Eval-Driven Memory (88% precision), Adaptive Planning (PEI-guided control), and Human-Centered Explainability (91% transparency, 4.62/5.0 trust), achieving 94.2% FRR with comprehensive ablation validating component contributions.
- Failure Taxonomy: Characterized 24 systematic failure modes across hierarchical propagation patterns (cascade 37.5%, OOD 45.8%, ambiguous 16.7%), establishing empirical bounds on achievable reliability () and identifying conditions requiring human oversight.
- Certification Framework: Proposed three-tier deployment standards (Operational, Transparent, Frontier) with explicit threshold requirements, providing foundation for community consensus on acceptable reliability levels per application domain.
10.2. Theoretical Insights
10.3. Broader Impact
- Risk-Informed Deployment: Evidence-based decisions about when agents are "safe enough" for production .
- Regulatory Compliance: Audit trails for safety-critical applications supporting emerging AI governance frameworks
- Research Prioritization: Ablation studies reveal memory governance yields greater gains than planning algorithms alone
- Human-AI Collaboration: FRR bounds establish where human oversight transitions from beneficial to mandatory [23]
10.4. Call for Community Validation
- Independent Architecture Testing: Apply HB-Eval to systems designed without knowledge of metrics
- Scale Expansion: Phase 2 validation on 1,500+ tasks across spatial reasoning, long-horizon planning, multi-agent coordination
- Adversarial Robustness: Extend fault taxonomy to Byzantine faults, prompt injection, resource exhaustion
- Longitudinal Deployment: 6-month field trials tracking reliability degradation in production environments
- Certification Standards: Multi-stakeholder consortium establishing field-wide threshold consensus
10.5. Final Reflection
Reproducibility Statement

Datasets
- 500 episodes with full task specifications
- Oracle-verified optimal paths () with expert agreement logs
- Fault injection configurations and timing distributions
- Human annotation data (TI scores, safety assessments)
- Execution traces for all 1,500 trials (500 episodes × 3 architectures)
Pre-Trained Models
- Calibrated GPT-4o judge with prompt templates
- EDM memory stores (FAISS indices + SQL metadata)
- Baseline checkpoints (ReAct, Reflexion configurations)
Computational Requirements
- Minimum: 1× NVIDIA RTX 3090 (24GB VRAM)
- Recommended: 4× NVIDIA A100 (80GB VRAM each)
- Estimated runtime: 18-20 hours (minimum) / 68 hours (recommended)
- Cloud cost estimate: $600-$2,200 depending on hardware
Acknowledgments
References
- S. Bubeck et al., "Sparks of Artificial General Intelligence: Early experiments with GPT-4," arXiv preprint arXiv:2303.12712, 2023. [CrossRef]
- R. Yao, H. Wang, Y. Huang, et al., "ReAct: Synergizing Reasoning and Acting in Language Models," arXiv preprint arXiv:2210.03629, 2022.
- N. Shinn, S. Gendelman, E. Geva, and J. Lin, "Reflexion: an autonomous agent with dynamic memory and self-reflection," arXiv preprint arXiv:2303.11366, 2023.
- X. Liu et al., "AgentBench: Evaluating LLMs as Agents," arXiv preprint arXiv:2308.03688, 2023. [CrossRef]
- G. Mialon et al., "GAIA: A Benchmark for General AI Assistants," arXiv preprint arXiv:2311.12983, 2023. [CrossRef]
- S. Zhou et al., "WebArena: A Realistic Web Environment for Building Autonomous Agents," arXiv preprint arXiv:2307.13854, 2023.
- D. Amodei et al., "Concrete Problems in AI Safety," arXiv preprint arXiv:1606.06565, 2016. [CrossRef]
- Gupta, S.; Sharma, R. Ethical Concerns in Metric-Driven Autonomous Agents: Bias, Drift, and Control. IEEE Transactions on AI Ethics 2025, vol. 3(no. 4), 301–315. [Google Scholar]
- Arlat, J.; et al. Fault Injection for Dependability Validation: A Methodology and Some Applications. IEEE Transactions on Software Engineering 1990, vol. 16(no. 2), 166–182. [Google Scholar] [CrossRef]
- W. Zhong et al., "MemoryBank: Enhancing Large Language Models with Long-Term Memory," AAAI, 2024.
- J. Park, K. M. L. O. Lee, and A. B. C. Kim, "Generative Agents: Interactive Simulacra of Human Behavior," arXiv preprint arXiv:2304.03442, 2023. [CrossRef]
- Wang, B.; et al. DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. NeurIPS, 2023. [Google Scholar]
- Durumeric, Z.; et al. The Matter of Heartbleed. IMC 2014. [Google Scholar]
- , Adaptive Monitoring and Real-World Evaluation of Agentic AI Systems, Shukla, Manish, arXiv preprint, arXiv:2509.00115,2025.
- Turpin, M.; et al. Language Models Don’t Always Say What They Think. NeurIPS 2023. [Google Scholar]
- Lipton, Z. C. The Mythos of Model Interpretability. Queue 2018, vol. 16(no. 3), 31–57. [Google Scholar] [CrossRef]
- Wei, J.; et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS, 2022. [Google Scholar]
- Schaul, T.; et al. Prioritized Experience Replay. ICLR, 2016. [Google Scholar]
- Hendrycks, D.; Dietterich, T. Benchmarking Neural Network Robustness to Common Corruptions and Perturbations. ICLR, 2019. [Google Scholar]
- Hendrycks, D.; et al. Natural Adversarial Examples. CVPR, 2021. [Google Scholar]
- Zheng, L.; et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 2023. [Google Scholar]
- Wiegreffe, S.; Pinter, Y. Attention is not not Explanation. EMNLP, 2019. [Google Scholar]
- Schöller, M. F. H.; Russell, S. J. Human-Artificial Interaction in the Age of Agentic AI: A System-Theoretical Approach. arXiv 2025, arXiv:2502.14000. [Google Scholar]
- Shukla, A.; Patel, R. Evaluating Trust in Autonomous Medical Diagnostic Systems. Journal of Medical AI 2025. [Google Scholar]
- Karatas, E. Privacy by Design in AI Agent Systems. Medium Article, 2025. [Google Scholar]
- P. Liang et al., "Holistic Evaluation of Language Models," arXiv preprint arXiv:2211.09110, 2023.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.

