Submitted:
28 August 2025
Posted:
03 September 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
Contributions.
- We synthesise recent literature and industrial reports to demonstrate that technical metrics continue to dominate evaluations (83% of studies report capability metrics) while human-centred, safety and economic axes are often ignored [2].
- We propose AMDM, a practical algorithm that normalises heterogeneous metrics, applies per-axis adaptive thresholds via exponentially weighted moving averages, and performs joint anomaly detection using Mahalanobis distance. A concise pseudo-code and complexity analysis accompany the method.
- Through simulations and real-world logs we show that AMDM cuts anomaly-detection latency from 12.3 s to 5.6 s and lowers the false-positive rate from 4.5% to 0.9% compared with static thresholds. We quantify sensitivity to hyperparameters (, w, k) and release our code and data to support reproducibility.
- We reanalyse deployments in software modernisation, data quality and credit-risk memo drafting. Although reported productivity gains range from 20% to 60% and credit-turnaround times decrease by about 30% [4], we highlight missing metrics such as developer trust, fairness and energy consumption.
2. Related Work and Persistent Gaps
2.1. Existing Metrics and Frameworks
2.2. Measurement Imbalance in Practice
3. A Balanced Evaluation Framework
- Capability & Efficiency: measures of task completion, latency and resource utilisation.
- Robustness & Adaptability: resilience to noisy inputs, adversarial prompts and changing goals.
- Safety & Ethics: avoidance of toxic or biased outputs and adherence to legal and ethical norms.
- Human-Centred Interaction: user satisfaction, trust and transparency. Instruments such as TrAAIT evaluate perceived credibility, reliability and value [5].
- Economic & Sustainability Impact: productivity gains, cost per outcome and carbon footprint.
4. Adaptive Multi-Dimensional Monitoring
4.1. Metric Normalisation
4.2. Adaptive Thresholding
4.3. Joint Anomaly Detection
4.4. Algorithm Summary
4.5. Calibration and Operational Guidance
| Algorithm 1 Adaptive Multi-Dimensional Monitoring (AMDM) |
|
5. Experiments and Evaluation
5.1. Experimental Setup
Simulated workflows.
Real-world logs.
5.2. Detection Latency and Early Warning
| Method | Latency (s) | FPR (%) |
|---|---|---|
| Static thresholds | ||
| EWMA-only | ||
| Mahalanobis-only | ||
| AMDM (ours) |
5.3. Ablations and Sensitivity
5.4. Concept Drift vs. Sudden Shocks
Axis attribution.
5.5. ROC and Precision–Recall Curves
5.6. Benchmarking Breadth and Baseline Comparison
5.7. Hyperparameter Defaults
6. Case Studies and Real-World Reanalysis
6.1. Legacy Modernisation
6.2. Data Quality and Insight Generation
6.3. Credit-Risk Memo Generation
7. Discussion and Implications
7.1. Balanced Benchmarks and Leaderboards
7.2. Reproducibility and Openness
7.3. Human–Agent Collaboration and Trust
7.4. Policy and Governance
7.5. Ethics and Limitations
8. Conclusions
Acknowledgments
Appendix A. Reproducibility Checklist
- Random seeds: 1337 for simulations; fixed seeds per fold for real-world logs.
- Hardware: x86_64 CPU with 16 GB RAM; no GPU required.
- Software: Python 3.11 with NumPy, Matplotlib and Scikit-learn.
- Logging schema: timestamp, metric vector, anomaly label and method decision.
- Scripts:run_simulation.py, eval_deployment.py and plot_figures.py reproduce all figures.
References
- R. Sapkota, K. I. Roumeliotis, and M. Karkee, “AI agents vs. agentic AI: A conceptual taxonomy, applications and challenges,” Information Fusion, 2025. Available on arXiv:2505.10468.
- K. J. Meimandi, G. Aránguiz-Dias, G. R. Kim, L. Saadeddin, and M. J. Kochenderfer, “The measurement imbalance in agentic AI evaluation undermines industry productivity claims,” 2025. arXiv:2506.02064. This work documents that capability metrics dominate agentic AI evaluations (around 83% of surveyed studies) while human-centred and economic metrics are each considered in roughly 30% of studies.
- R. Arike, E. Donoway, H. Bartsch, and M. Hobbhahn, “Technical report: Evaluating goal drift in language model agents,” 2025. arXiv:2505.02709. The authors demonstrate that agents given a goal and then exposed to competing objectives exhibit gradual drift.
- B. Heger, “Seizing the agentic AI advantage,” 2025. Blog post summarising the McKinsey report. The article reports productivity gains of 20–60 % and approximately 30 % faster credit-turnaround times for agentic AI deployments.
- A. F. Stevens, P. Stetson, et al., “Theory of trust and acceptance of artificial intelligence technology (TrAAIT): An instrument to assess clinician trust and acceptance of artificial intelligence,” Journal of Biomedical Informatics, vol. 148, 2023. The TrAAIT model measures trust through perceived information credibility, system reliability and application value. [CrossRef]
- C. Dilmegani, “Large language model evaluation in 2025: 10+ metrics and methods,” 2025. AIMultiple article advocating multidimensional evaluation integrating automated scores with human assessments and tests for bias, fairness and energy consumption.
- M. A. Shukla, “Evaluating agentic AI systems: A balanced framework for performance, robustness, safety and beyond,” 2025. Preprint available at Preprints.org.



| Parameter | Value | Rationale |
|---|---|---|
| (EWMA smoothing) | 0.25 | Balances reactivity and stability |
| w (rolling window) | 80 | Covers typical cycle length |
| k (joint threshold) | ~ 1 % joint false-alarm rate | |
| Covariance update | Shrinkage | Robust to small-sample noise |
| Case | Agentic approach | Reported impact | Missing metrics |
|---|---|---|---|
| Legacy modernisation | Humans supervise squads of agents to document code, write new modules, review and integrate features | reduction in time/effort | Trust scores, bias, goal drift |
| Data quality & insights | Agents detect anomalies, analyse internal/external signals and synthesise drivers | potential productivity gain; annual savings | Fairness, user satisfaction, safety |
| Credit-risk memos | Agents extract data, draft sections, generate confidence scores; humans supervise | 20–60 % productivity gain; faster credit decisions | Fairness, transparency, energy consumption |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).