Submitted:
18 March 2025
Posted:
19 March 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- -
- A framework for generic blue agent training capable of dealing with various types of red agents.
- -
- An enhanced version of the framework that incorporates a custom feed-forward neural network based on the principles of variational auto-encoder (VAE) into the original generic blue agent training framework.
- -
- An examination of the limitations of retraining blue agents against different types of red agents in an attempt to develop a generalized blue agent.
- -
- Experimental results demonstrating the effectiveness of the proposed framework.
2. Background and Related Work
2.1. Supervised Learning
2.2. Reinforcement Learning
- -
- DRL can effectively manage high-dimensional input spaces, due to its capacity to acquire hierarchical data representations using deep neural networks.
- -
- DRL models, when trained on extensive datasets, can exhibit strong generalization capabilities to novel environments or tasks, potentially minimizing the necessity for retraining on similar tasks.
- -
- DRL has the capacity to acquire valuable representations of states and actions, facilitating more efficient learning and decision-making processes.
- -
- Off-policy methods
- -
- On-policy methods
2.2.1. Off-Policy Methods
2.2.2. On-Policy Methods
- -
- It may require more interactions with the environment compared to off-policy methods, as it cannot reuse data from different policies.
- -
- Managing the balance between exploration and exploitation can present more difficulty, as the policy needs to address both aspects simultaneously.
2.3. Deep Reinforcement Learning Algorithms
- -
- DRL can effectively manage high-dimensional input spaces, due to its capacity to acquire hierarchical data representations using deep neural networks.
- -
- DRL models, when trained on extensive datasets, can exhibit strong generalization capabilities to novel environments or tasks, potentially minimizing the necessity for retraining on similar tasks.
- -
- DRL has the capacity to acquire valuable representations of states and actions, facilitating more efficient learning and decision-making processes.
2.3.1. Deep Q-Networks
2.3.2. Advantage Actor-Critic Algorithm
2.3.3. Proximal Policy Optimization
2.4. Reinforcement Learning Environments for Autonomous Cyber Operations
2.4.1. CybORG: An Autonomous Cyber Operations Research Gym
2.5. Reinforcement Learning Based Blue Agents’ Training Methods
- -
- Single agent
- -
- Hierarchical agent
- -
- Ensembles
- -
- Hierarchical agents generally perform the best, followed by ensemble agents, and then single agents.
- -
- The PPO algorithm has been noted for its superior performance.
- -
- Overall, RL-based methods show better generalization capabilities compared to non-RL-based approaches.
2.6. Discussion
3. Framework for Generic Blue Agent Training
- -
- HP tunning for training a blue agent against a red agent using RL.
- -
- Training blue agents against red agents using RL.
- -
- Collecting data with the trained blue agents.
- -
- Randomizing the collected data.
- -
- HP tuning for supervised learning to train a generalized blue agent.
- -
- Training a generalized blue agent using a supervised machine learning algorithm.
- -
- Testing and deploying the generalized blue agent.
3.1. Framework’s Component Details
3.1.1. Hyper-Parameter Tuning for Blue Agent Training Using Reinforcement Learning
- -
- Create an RL-based learning environment that includes various entities (hosts, servers, firewalls, switches, routers, etc.) along with a blue agent and the corresponding red agent.
- -
- Provide a range of values for all possible HPs.
- -
- Utilize an HP tuning library for RL, such as Optuna [29], to find the best HPs for training the blue agent against a red agent.
- -
- Store the optimal HP values obtained in the previous step.
- -
- Repeat the above steps for each blue agent that needs to be trained.
3.1.2. Reinforcement-Learning-Based Blue Agent Training
3.1.3. Data Collection - Observations and Action Tuples
| Algorithm 1: Observations and Action Tuples Collection |
![]() |
3.1.4. Randomizing Sequencing of Collected Data
3.1.5. Supervised Learning - Hyper-parameter Optimization
- -
- Select a suitable supervised machine learning algorithm.
- -
- Choose a reasonable range of values for the different HPs associated with the selected algorithm.
- -
- Provide the supervised machine learning method, the list of HPs and their value ranges, and the number of cross-validation folds to an API that searches for the best HP values.
- -
- Use the selected library’s API, such as the fit method in SciKit-Learn, to obtain the best HP values.
3.1.6. Training a Generic Blue Agent
3.1.7. Test and Deploy Generic Blue Agent
3.2. Custom Variational Auto Encoder for Generic Blue Agent Training
- -
- Each observation tuple is mapped to two latent vectors of dimension N representing mean and log variance of a standard normal distribution.
- -
- For each latent variable a point is sampled from a standard normal distribution using corresponding and .
- -
- Latent variables are feed to the decoder to produce an action for a blue agent (the output of this VAE-V is a scalar value).
4. Performance Evaluation
- -
- A blue agent specifically trained against the red agent, referred to as .
- -
- A blue agent specifically trained against the red agent, referred to as .
- -
- retrained against the red agent, referred to as BL-RED.
- -
- retrained against the red agent, referred to as RED-BL.
- -
- A proposed framework-based blue agent that uses a Multilayer Perceptron approach, referred to as .
- -
- A proposed framework-based blue agent that uses a Support Vector Machine approach, referred to as .
- -
- A proposed framework-based blue agent that uses VAE-V, referred to as VAE-V.
- -
- A hierarchical super agent, a blue agent that first determines the kind of attacking red agent, and then launches the specifically trained blue agent for that particular red agent. Variants of the hierarchical blue agent that misclassify the attacking agent by , , , and are referred to as , , , and , respectively.
| Hyper-Parameter | Range of Values |
|---|---|
| Batch Size | [8, 16, 32, 64, 128, 256, 512] |
| No. of Steps | [8, 16, 32, 64, 128, 256, 512, 1024, 2048] |
| Gamma | [0.9, 0.95, 0.98, 0.99, 0.995, 0.999, 0.9999] |
| Learning Rate | log range(1e-5, 1) |
| Entropy Coefficient | log range(0.00000001, 0.1) |
| Clip Range | [0.1, 0.2, 0.3, 0.4] |
| No. of Epochs | [1, 5, 10, 20] |
| GAE Lambda | [0.8, 0.9, 0.92, 0.95, 0.98, 0.99, 1.0] |
| Max Gradient Norm | 0.3, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 2, 5] |
| VF Coefficient | [0, 1] |
| Activation Function | [tanh, relu, elu, leaky relu] |
| Neural Network Architecture | [(pi=[64, 64], vf=[64, 64]), (pi=[256, 256], vf=[256, 256])] |
| Hyper-Parameter | b_line Red Agent | Red Meander Agent |
|---|---|---|
| Batch Size | 8 | 256 |
| No. of Steps | 128 | 1024 |
| Gamma | 0.9 | 0.95 |
| Learning Rate | 0.00018937 | 0.00373109 |
| Entropy Coefficient | 1.032461073e-05 | 0.017615274 |
| Clip Range | 0.3 | 0.3 |
| No. of Epochs | 10 | 10 |
| GAE Lambda | 0.8 | 0.8 |
| Max Gradient Norm | 0.5 | 5 |
| VF Coefficient | 0.09187 | 0.73582 |
| Activation Function | relu | tanh |
| Neural Network Architecture | pi=[256, 256], vf=[256, 256] | pi=[64, 64], vf=[64, 64] |
4.1. Results Based on Training Topology
4.2. Varying the Network Topology
4.2.1. Topology 2 Results
- -
- The performance of all blue agents trained solely with Reinforcement learning (BL, RED, and all HAs) has deteriorated in this modified environment.
- -
- The performance of the single blue agent trained with the proposed framework either deteriorated (MLP and VAE-V) or, in the case of SVM, actually demonstrates slightly improved total mean reward.
4.2.2. Topology 3 Results
- -
- When the attacking red agent is , BL and VAE-V demonstrate similar performance in both topologies. SVM, VAE-V, RED, and all HAs demonstrate better performance in Topology 3.
- -
- When the attacking red agent is , all blue agents’ performance deteriorates in Topology 3.
5. Conclusion
References
- Farooq, M.O.; Wheelock, I.; Pesch, D. IoT-Connect: An Interoperability Framework for Smart Home Communication Protocols. IEEE Consumer Electronics Magazine 2020, 9, 22–29. [Google Scholar] [CrossRef]
- Yeong, D.J.; Velasco-Hernandez, G.; Barry, J.; Walsh, J. Sensor and Sensor Fusion Technology in Autonomous Vehicles: A Review. Sensors 2021, 21. [Google Scholar] [CrossRef] [PubMed]
- Zhang, C.; Si, X.; Zhu, X.; Zhang, Y. A Survey on the Security of the Metaverse. In Proceedings of the 2023 IEEE International Conference on Metaverse Computing, Networking and Applications (MetaCom); 2023; pp. 428–432. [Google Scholar]
- Nichols, W.; Hill, Z.; Hawrylak, P.; Hale, J.; Papa, M. Automatic Generation of Attack Scripts from Attack Graphs. In Proceedings of the 2018 1st International Conference on Data Intelligence and Security (ICDIS); 2018; pp. 267–274. [Google Scholar]
- Sultana, M.; Taylor, A.; Li, L. Autonomous network cyber offence strategy through deep reinforcement learning. In Proceedings of the Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications III. International Society for Optics and Photonics, SPIE; 2021; Vol. 11746, p. 1174622. [Google Scholar]
- Haque, N.I.; Shahriar, M.H.; Dastgir, M.G.; Debnath, A.; Parvez, I.; Sarwat, A.; Rahman, M.A. A Survey of Machine Learning-based Cyber-physical Attack Generation, Detection, and Mitigation in Smart-Grid. In Proceedings of the 52nd North American Power Symposium (NAPS); 2021; pp. 1–6. [Google Scholar]
- Ghanem, M.C.; Chen, T.M. Reinforcement Learning for Efficient Network Penetration Testing. Information 2020, 11. [Google Scholar]
- Pozdniakov, K.; Alonso, E.; Stankovic, V.; Tam, K.; Jones, K. Smart Security Audit: Reinforcement Learning with a Deep Neural Network Approximator. In Proceedings of the 2020 International Conference on Cyber Situational Awareness, Data Analytics and Assessment (CyberSA); 2020; pp. 1–8. [Google Scholar]
- Fang, Z.; Wang, J.; Li, B.; Wu, S.; Zhou, Y.; Huang, H. Evading Anti-Malware Engines With Deep Reinforcement Learning. IEEE Access 2019, 7, 48867–48879. [Google Scholar]
- Pan, Z.; Sheldon, J.; Mishra, P. Hardware-Assisted Malware Detection using Explainable Machine Learning. In Proceedings of the 2020 IEEE 38th International Conference on Computer Design (ICCD); 2020; pp. 663–666. [Google Scholar]
- Han, G.; Xiao, L.; Poor, H.V. Two-dimensional anti-jamming communication based on deep reinforcement learning. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2017; pp. 2087–2091. [Google Scholar]
- Pu, Z.; Niu, Y.; Zhang, G. A Multi-Parameter Intelligent Communication Anti-Jamming Method Based on Three-Dimensional Q-Learning. In Proceedings of the 2022 IEEE 2nd International Conference on Computer Communication and Artificial Intelligence (CCAI); 2022; pp. 205–210. [Google Scholar]
- Huang, M. Theory and Implementation of linear regression. In Proceedings of the 2020 International Conference on Computer Vision, Image and Deep Learning (CVIDL); 2020; pp. 210–217. [Google Scholar]
- Zou, X.; Hu, Y.; Tian, Z.; Shen, K. Logistic Regression Model Optimization and Case Analysis. In Proceedings of the 2019 IEEE 7th International Conference on Computer Science and Network Technology (ICCSNT); 2019; pp. 135–139. [Google Scholar]
- Hearst, M.; Dumais, S.; Osuna, E.; Platt, J.; Scholkopf, B. Support vector machines. IEEE Intelligent Systems and their Applications 1998, 13, 18–28. [Google Scholar]
- Louppe, G. Understanding Random Forests: From Theory to Practice, 2015, [arXiv:stat.ML/1407.7502].
- Mundt, M.; Hong, Y.; Pliushch, I.; Ramesh, V. A Wholistic View of Continual Learning with Deep Neural Networks: Forgotten Lessons and the Bridge to Active and Open World Learning. Neural Networks 2023, 160, 306–336. [Google Scholar] [PubMed]
- Naeem, M.; Rizvi, S.T.H.; Coronato, A. A Gentle Introduction to Reinforcement Learning and its Application in Different Fields. IEEE Access 2020, 8, 209320–209344. [Google Scholar] [CrossRef]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Nature2015, 518, 529–533. [CrossRef]
- Molina-Markham, A.; Miniter, C.; Powell, B.; Ridley, A. Network Environment Design for Autonomous Cyberdefense, 2021, [arXiv:cs.CR/2103.07583].
- Li, L.; Rami, J.P.S.E.; Taylor, A.; Rao, J.H.; Kunz, T. Unified Emulation-Simulation Training Environment for Autonomous Cyber Agents, 2023, [arXiv:cs.LG/2304.01244].
- Li, L.; Fayad, R.; Taylor, A. CyGIL: A Cyber Gym for Training Autonomous Agents over Emulated Network Systems, 2021, [arXiv:cs.CR/2109.03331].
- Cyber Battle Sim. https://github.com/microsoft/CyberBattleSim. Last accessed: 20th Jan, 2024.
- Baillie, C.; Standen, M.; Schwartz, J.; Docking, M.; Bowman, D.; Kim, J. CybORG: An Autonomous Cyber Operations Research Gym, 2020, [arXiv:cs.CR/2002.10667].
- OpenAI Gym. (https://gymnasium.farama.org). "Last accessed: 5th August, 2024.
- Kiely, M.; Bowman, D.; Standen, M.; Moir, C. On Autonomous Agents in a Cyber Defence Environment, 2023, [arXiv:cs.CR/2309.07388].
- Kunz, T.; Fisher, C.; Novara-Gsell, J.L.; Nguyen, C.; Li, L. A Multiagent CyberBattleSim for RL Cyber Operation Agents, 2023, [arXiv:cs.CR/2304.11052].
- Foley, M.; Hicks, C.; Highnam, K.; Mavroudis, V. Autonomous Network Defence Using Reinforcement Learning, New York, NY, USA, 2022; ASIA CCS ’22, p. 1252–1254.
- Optuna: A hyperparameter optimization framework. (https://optuna.readthedocs.io/en/stable/). "Last accessed: 7th May, 2024.
- TTCP CAGE Working Group. TTCP CAGE Challenge 4. https://github.com/cage-challenge/cage-challenge-4, 2023.














| Action | Description |
|---|---|
| Sleep | Does nothing, and there is no parameter associated. |
| DiscoverRemoteSystems | It performs a ping sweep, and it requires subnet as a parameter. It returns active IP addresses on the subnet. |
| DiscoverNetworkServices | It is a port scan action. It requires IP address as a parameter. It returns all the open ports and their respective services. |
| ExploitRemoteService | It is used for a service exploit to obtain a reverse shell on a host. IP address is required as an input parameter. |
| EternalBlue | Used to obtain SYSTEM access over a windows-based machine. |
| PrivilegeEscalate | It is used to establish a privileged shell with root (Linux) or SYSTEM (Windows) privileges. For this action to succeed a user shell is required on the target host. |
| FTPDirectoryTraversal | Used to traverse directories that a user is not supposed to access. |
| SQLInjection | Action that inserts malicious database code to execute. |
| Impact | Represents degradation of service. It requires host name as a parameter. |
| Action | Description |
|---|---|
| Sleep | Does nothing, and there is no parameter associated. |
| Analyse | Detects malware files on a host. Needs host name as a parameter. |
| Remove | Removes a red agent’s user-level shell. |
| Restore | Restores a system state to a known safe state (results in network services disruption, hence large negative plenty is associated with this action). |
| Misinform | A decoy service action. |
| Monitor | Used to monitor host’s state. |
| Subnet | Hosts | Blue Reward for Red Access |
|---|---|---|
| Subnet 1 | User Hosts | -0.1 |
| Subnet 2 | Enterprise Servers | -1 |
| Subnet 3 | Operational Server | -1 |
| Subnet 3 | Operational Host | -0.1 |
| Agent | Host | Action | Blue Reward |
|---|---|---|---|
| Red | Operational Server | Impact | -10 |
| Blue | Any | Restore | -1 |
| Agent Type | BL | RED | BL-RED | RED-BL | MLP | SVM |
|---|---|---|---|---|---|---|
| Total Mean Reward | -140.88 | -122.96 | -151.33 | -198.75 | -117.70 | -124.61 |
| Agent Type | VAE-V | HA5 | HA10 | HA15 | HA20 | |
| Total Mean Reward | -110.90 | -117.60 | -123.10 | -126.60 | -127.43 |
| Agent Type | BL | RED | MLP | SVM | VAE-V |
|---|---|---|---|---|---|
| Total Mean Reward | -57.4 | -58.83 | -52.39 | -52.92 | -50.4 |
| Agent Type | HA5 | HA10 | HA15 | HA20 | |
| Total Mean Reward | -50.4 | -52.3 | -54.0 | -55.4 |
| Agent Type | BL | RED | MLP | SVM | VAE-V |
|---|---|---|---|---|---|
| Total Mean Reward | -148 | -132.0 | -122.2 | -119 | -116.45 |
| Agent Type | HA5 | HA10 | HA15 | HA20 | |
| Total Mean Reward | -120.08 | -126.14 | -128.32 | -130.5 |
| Agent Type | BL | RED | MLP | SVM | VAE-V |
|---|---|---|---|---|---|
| Total Mean Reward | -61 | -61.1 | -54.43 | -55.4 | -50.3 |
| Agent Type | HA5 | HA10 | HA15 | HA20 | |
| Total Mean Reward | -50.38 | -52.69 | -54.9 | -56.40 |
| Agent Type | BL | RED | MLP | SVM | VAE-V |
|---|---|---|---|---|---|
| Total Mean Reward | -133.3 | -136.61 | -131.10 | -117.55 | -123.24 |
| Agent Type | HA5 | HA10 | HA15 | HA20 | |
| Total Mean Reward | -131.74 | -136.64 | -138.01 | -139.93 |
| Agent Type | BL | RED | MLP | SVM | VAE-V |
|---|---|---|---|---|---|
| Total Mean Reward | -55.49 | -54.6 | -50.83 | -48.77 | -51.24 |
| Agent Type | HA5 | HA10 | HA15 | HA20 | |
| Total Mean Reward | -57.98 | -59.45 | -60.4 | -61.7 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
