Submitted:
09 January 2026
Posted:
13 January 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. Motivation: The Oversight Challenge in Autonomous Agents
1.2. Limitations of Current Interpretability Approaches
1.3. Research Question and Proposed Approach
1.4. Contributions
- 1.
- Conceptual framework: We introduce Performance-Grounded Interpretability as a design principle that separates evidence exposure from justification generation, positioning interpretability as a mechanism for surfacing documented performance rather than generating narratives. We provide formal definitions distinguishing this approach from traditional explainable AI.
- 2.
- System implementation: We present HCI-EDM, a four-stage pipeline that generates explanations by querying evaluation-certified episodes from memory, with architectural constraints that prevent explanation generation when no qualified precedent exists. We describe the integration with evaluation-driven memory architectures.
- 3.
- Controlled evaluation: We provide empirical evidence from simulated oversight scenarios ( episodes) indicating that performance-grounded explanations may improve trust calibration metrics and reduce decision time compared to chain-of-thought baselines under controlled conditions. We report transparency metrics quantifying the proportion of verifiable explanation claims.
1.5. Scope and Limitations
2. Related Work
2.1. Explainability in Language Models
2.2. Trajectory-Based and Example-Based Explanations
2.3. Model-Centric XAI
2.4. Episodic Memory in Agents
2.5. Positioning This Work
3. Background: The Evaluation-Driven Architecture Stack
- Planning Efficiency Index (PEI): computed as the ratio of executed actions to optimal plan length, providing a normalized measure of plan quality
- Failure Recovery Rate (FRR): success rate in recovering from tool failures or constraint violations during execution
- Transparency Index (TI): inspectability of execution traces on a scale from 1 to 5, measuring structural completeness of stored traces
4. Performance-Grounded Interpretability: Principles and Design
4.1. Core Principle
- 1.
- Explanations reference specific execution traces with documented performance metrics computed independently of the explanation generation process
- 2.
- Humans can independently verify explanation claims by inspecting cited episodes and validating that stated metrics match stored episode data
- 3.
- When no qualified precedent exists in memory (no episodes meeting both similarity and quality thresholds), the system signals uncertainty rather than generating speculative justification
- Completeness of state-action sequences in stored traces
- Availability of intermediate artifacts and observations
- Reproducibility of performance metric computation from stored data
4.2. Distinction from Traditional XAI
- Traditional XAI (SHAP, attention visualization, LIME) provides model introspection—insight into features, decision boundaries, or internal representations that produce specific outputs. Primary use case: ML practitioners debugging models or validating feature usage. Question answered: “How does the model compute this output?”
- Performance-grounded interpretability provides execution evidence—documentation that strategies have succeeded in similar contexts with quantified metrics from evaluated episodes. Primary use case: overseers making trust decisions about behavioral reliability. Question answered: “Has demonstrated competence been documented for this type of decision?”
4.3. Architectural Constraints for Evidence-Based Explanations
5. HCI-EDM System Architecture
5.1. Overview


5.2. Stage 1: Trigger Detection
- PEI degradation: Planning efficiency drops below threshold by more than 0.2 units (PEI ), indicating potential performance issues
- Recovery activation: Agent switches from strategic planning mode to tactical recovery mode as defined in Adapt-Plan
- Unexpected action: Agent selects action outside predicted probability distribution from planning policy
- Human query: Overseer explicitly requests explanation through interface
5.3. Stage 2: Evidence Retrieval
| Algorithm 1:Evidence Retrieval Protocol |
|
5.4. Stage 3: Template Instantiation
5.5. Stage 4: Surface Realization
- Preserve all quantitative values exactly as provided in template
- Do not add speculative claims beyond template content
- Include episode IDs for verifiability
- Avoid hedging language when evidence is certain (for Templates 1–3)
6. Evaluation Methodology
6.1. Research Questions
6.2. Simulation Protocol
- Tool failure: routing API timeout requiring recovery strategy
- Data corruption: conflicting delivery time windows requiring constraint relaxation
- Resource constraint violation: vehicle capacity exceeded requiring reallocation
- Cascade failure: multiple tools unavailable simultaneously requiring alternative approach
- Baseline (CoT): Agent with chain-of-thought reasoning traces providing standard ReAct-style step-by-step explanations
- HCI-EDM: Agent with performance-grounded explanations referencing evaluated episodes with PEI/FRR metrics from memory
6.3. Oversight Simulation
6.4. Metrics
- Trust score: mean rating on 1–5 scale across all fault scenarios
- Decision comprehension time: time from explanation presentation to trust rating submission (proxy for cognitive load)
- Transparency index: proportion of explanations enabling independent verification through cited episode inspection
- Trust calibration: correlation between trust score and actual agent success rate
- Intervention rate: proportion of episodes where proxy model recommended manual oversight based on explanation
7. Results
7.1. Primary Metrics
7.2. Trust Calibration
7.3. Explanation Type Distribution
8. Discussion
8.1. Interpretation of Results
8.2. Limitations
8.3. What This Work Does Not Claim
- Safety guarantees: HCI-EDM exposes certified behavior from evaluated episodes but cannot ensure that such behavior is safe, aligned with human values, or appropriate for current context. Past success does not guarantee future safety, particularly under distribution shift.
- Correctness proofs: Performance-grounded explanations document past success with quantified metrics but do not prove future reliability, correctness of decisions, or optimality of strategies. They provide evidence, not proofs.
- Universal superiority: Different interpretability needs may favor different approaches depending on context. PGI is designed for oversight scenarios where behavioral reliability assessment is primary, not all interpretability goals. Model debugging, fairness auditing, or pedagogical explanation may require different approaches.
- Production readiness: Controlled evaluation establishes proof-of-concept under constrained conditions; extensive deployment validation including human subject studies, domain adaptation, scalability testing, and failure mode analysis is essential before operational use.
- Human study validation: Results are based on simulated oversight with proxy models; real human subject studies with domain experts in operational settings are necessary to validate trust calibration, cognitive load effects, and intervention decision patterns.
- Deployment validation: The system has not been validated in operational environments with real stakes, time pressure, organizational constraints, or extended use periods that would reveal long-term trust dynamics and failure modes.
8.4. Relationship to Prior Work
- HB-Eval [10]: Provides the PEI, FRR, and TI metrics that HCI-EDM exposes in explanations. Without standardized evaluation producing these metrics, performance-grounded explanations would not be possible.
- Adapt-Plan [11]: Generates adaptive behaviors and recovery strategies that HCI-EDM explains by referencing precedents. The control layer creates the behavioral patterns that require explanation.
- EDM [9]: Creates the quality-filtered episode repository that HCI-EDM queries for explanation generation. Without performance-based memory consolidation, no certified precedents would exist to reference.
9. Future Work
9.1. Critical Next Steps
9.2. Broader Extensions
10. Conclusions
Acknowledgments
References
- Wei, J.; Wang, X.; Schuurmans, D.; et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Proceedings of the Advances in Neural Information Processing Systems, 2022.
- Shinn, N.; Labash, B.; Gopinath, A.; et al. Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv preprint arXiv:2303.11366 2023.
- Turpin, M.; Michael, J.; Bowman, S. Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. arXiv preprint arXiv:2305.04388 2023.
- Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the Advances in Neural Information Processing Systems, 2017.
- Ribeiro, M.T.; Singh, S.; Guestrin, C. Why Should I Trust You?: Explaining the Predictions of Any Classifier. In Proceedings of the Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016.
- Agarwal, S.; Niekum, S. T-REX: Trajectory-Based Explainable AI for Robot Learning. In Proceedings of the IEEE International Conference on Robotics and Automation, 2023.
- Wang, Z.; et al. TRAIL: Transparent Reasoning through Agent Interpretable Logs. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2023.
- Park, J.S.; et al. Generative Agents: Interactive Simulacra of Human Behavior. In Proceedings of the Proceedings of the ACM Conference on User Interface Software and Technology, 2023.
- Adam, A.M.I. Eval-Driven Memory (EDM): A Persistence Governance Layer for Reliable Agentic AI via Metric-Guided Selective Consolidation. Preprints.org 2025. [CrossRef]
- Adam, A.M.I. HB-Eval: A System Level Reliability Evaluation and Certification Framework for Agentic AI. Preprints.org 2025. [CrossRef]
- Adam, A.M.I. Adapt-Plan:A Hybrid Cotrol Architecture For PEI-Guided Reliable Adaptive Planning in Dynamic Agentic Enviromets. Preprints.org 2025. [CrossRef]
| Metric | CoT | HCI-EDM | (%) | p |
|---|---|---|---|---|
| Trust Score (1–5) | 3.87 ± 0.41 | 4.62 ± 0.28 | +19.4% | |
| Comprehension Time (s) | 42.3 ± 8.7 | 20.7 ± 5.2 | % | |
| Transparency Index | 0.43 | 0.91 | +111.6% |
| Condition | Pearson r | Interpretation | Intervention Rate |
|---|---|---|---|
| CoT Baseline | 0.54 | Moderate | 23% |
| HCI-EDM | 0.82 | Strong | 8% |
| Type | Count | Trust Score | Verification Rate |
|---|---|---|---|
| Success Confirmation | 72 | 4.81 ± 0.19 | 94% |
| Drift Correction | 34 | 4.52 ± 0.31 | 91% |
| Uncertainty Signal | 14 | 3.21 ± 0.47 | 86% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).