Submitted:
03 December 2025
Posted:
03 December 2025
You are already at the latest version
Abstract

Keywords:
I. Introduction
A. Problem Definition

B. Proposed System
- HarmActions, a structured dataset of safety-labeled agent actions complemented with manipulated prompts that trigger harmful or unethical actions.
- Agent Action Classifier, a neural classifier trained on the HarmActions dataset, designed to label proposed agent actions as potentially harmful or safe, and optimized for real-time deployment in agent loops.
- HarmActEval benchmark leveraging a novel metric “Harm@k.”
C. Related Work
II. Methods
A. Creation of a Dataset

B. Model Architecture
C. Training
D. Evaluation
E. Integration in the System
F. LLM Benchmark
III. Results
A. Evaluation Score
B. Experimentation on a Real-World Chat-Based Task


C. HarmActEval Results
IV. Discussion and Limitations
V. Conclusions
References
- F. Piccialli, D. Chiaro, S. Sarwar, D. Cerciello, P. Qi, and V. Mele, “AgentAI: A comprehensive survey on autonomous agents in distributed AI for industry 4.0,” Expert Systems with Applications, vol. 291, p. 128404, Oct. 2025. [CrossRef]
- H. Su et al., “A Survey on Autonomy-Induced Security Risks in Large Model-Based Agents,” June 30, 2025, arXiv: arXiv:2506.23844. [CrossRef]
- D. McCord, “Making Autonomous Auditable: Governance and Safety in Agentic AI Testing.” [Online]. Available: https://www.ptechpartners.com/2025/10/14/making-autonomous-auditable-governance-and-safety-in-agentic-ai-testing/.
- Z. J. Zhang, E. Schoop, J. Nichols, A. Mahajan, and A. Swearngin, “From Interaction to Impact: Towards Safer AI Agents Through Understanding and Evaluating Mobile UI Operation Impacts,” in Proceedings of the 30th International Conference on Intelligent User Interfaces, Mar. 2025, pp. 727–744. [CrossRef]
- M. Srikumar et al., “Prioritizing Real-Time Failure Detection in AI Agents,” Sept. 2025, [Online]. Available: https://partnershiponai.org/real-time-failure-detection.
- Anthropic, “Introducing the Model Context Protocol,” Anthropic. [Online]. Available: https://www.anthropic.com/news/model-context-protocol.
- Singh, A. Ehtesham, S. Kumar, and T. T. Khoei, “A Survey of the Model Context Protocol (MCP): Standardizing Context to Enhance Large Language Models (LLMs),” Preprints, Apr. 2025. [Google Scholar] [CrossRef]
- R. V. Yampolskiy, “On monitorability of AI,” AI and Ethics, vol. 5, no. 1, pp. 689–707, Feb. 2025. [CrossRef]
- M. Yu et al., “A Survey on Trustworthy LLM Agents: Threats and Countermeasures,” Mar. 12, 2025. arXiv:arXiv:2503.09648. [CrossRef]
- M. Shamsujjoha, Q. Lu, D. Zhao, and L. Zhu, “Swiss Cheese Model for AI Safety: A Taxonomy and Reference Architecture for Multi-Layered Guardrails of Foundation Model Based Agents,” Jan. 27, 2025. arXiv:arXiv:2408.02205. [CrossRef]
- Z. Xiang et al., “GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning,” May 29, 2025, arXiv: arXiv:2406.09187. [CrossRef]
- Z. Faieq, T. Sartori, and M. Woodruff, “Using LLMs to moderate LLMs: The supervisor technique,” TELUS Digital. [Online]. Available: https://www.telusdigital.com/insights/data-and-ai/article/llm-moderation-supervisor.
- Sentence Transformers, “all-MiniLM-L6-v2,” 2024, Hugging Face. [Online]. Available: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2.
- UKPLab, “Pretrained Models,” 2024, Sentence Transformers. [Online]. Available: https://www.sbert.net/docs/sentence_transformer/pretrained_models.html.
- M. Surdeanu and M. A. Valenzuela-Escárcega, “Implementing Text Classification with Feed-Forward Networks,” in Deep Learning for Natural Language Processing: A Gentle Introduction, Cambridge University Press, 2024, pp. 107–116.
- Steele, “Feed Forward Neural Network for Intent Classification: A Procedural Analysis,” Apr. 25, 2024, EngrXiv. [CrossRef]
- R. Kohli, S. Gupta, and M. S. Gaur, “End-to-End triplet loss based fine-tuning for network embedding in effective PII detection,” Feb. 13, 2025, arXiv: arXiv:2502.09002. [CrossRef]
- N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” Journal of Machine Learning Research, vol. 15, no. 56, pp. 1929–1958, 2014, [Online]. Available: http://jmlr.org/papers/v15/srivastava14a.html.
- J. Gu, C. Li, Y. Liang, Z. Shi, and Z. Song, “Exploring the Frontiers of Softmax: Provable Optimization, Applications in Diffusion Model, and Beyond,” May 06, 2024, arXiv: arXiv:2405.03251. [CrossRef]
- X. Ying, “An Overview of Overfitting and its Solutions,” Journal of Physics: Conference Series, vol. 1168, no. 2, p. 022022, Feb. 2019. [CrossRef]
- K. R. M. Fernando and C. P. Tsokos, “Dynamically Weighted Balanced Loss: Class Imbalanced Learning and Confidence Calibration of Deep Neural Networks,” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 7, pp. 2940–2951, 2022. [CrossRef]
- L. Prechelt, “Early Stopping — But When?,” in Neural Networks: Tricks of the Trade: Second Edition, G. Montavon, G. B. Orr, and K.-R. Müller, Eds., Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 53–67. [CrossRef]
- J. He and M. X. Cheng, “Weighting Methods for Rare Event Identification From Imbalanced Datasets.,” Front Big Data, vol. 4, p. 715320, 2021. [CrossRef]
- M. Hutson, “AI Agents Break Rules Under Everyday Pressure,” IEEE Spectrum. [Online]. Available: https://spectrum.ieee.org/ai-agents-safety.
- E. Tabassi, “Artificial Intelligence Risk Management Framework (AI RMF 1.0),” Jan. 26, 2023, NIST Trustworthy and Responsible AI, National Institute of Standards and Technology, Gaithersburg, MD. [CrossRef]
- Godhrawala, “Why organizations need distinct risk framework for Agentic AI,” EY. [Online]. Available: https://www.ey.com/en_in/media/podcasts/generative-ai/2025/11/episode-30-why-organizations-need-distinct-risk-framework-for-agentic-ai.
- Agarwal and J. Leung, “Preventing AI Agents from Going Rogue,” Palo Alto Networks. [Online]. Available: https://www.paloaltonetworks.com/blog/network-security/preventing-ai-agents-from-going-rogue/.
- OpenAI et al., “gpt-oss-120b & gpt-oss-20b Model Card,” Aug. 08, 2025, arXiv: arXiv:2508.10925. [CrossRef]
| Hyperparameter | Value |
|---|---|
| Hidden Layer Size | 128 |
| Learning Rate | 0.002 |
| Number of Epochs | 2 |
| Weight Decay | 0.00 |
| Batch Size | 4 |
| Model | Harm@k score |
|---|---|
| GPT-OSS-20B | 74.67% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).