Submitted:
15 May 2025
Posted:
15 May 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- Study the methodology of two adversarial algorithms - one employing a relatively simple perturbation approach, and the other utilizing a sophisticated evolutionary method;
- Present the perturbations introduced by adversarial algorithms in the dataset;
- Evaluate the impact of the said algorithms on the performance of an IDS classifier.
2. Conceptual Background
2.1. Network Intrusion Detection Systems
2.2. Machine Learning-Powered Network Intrusion Detection Systems
2.3. Adversarial algorithms
2.3.1. Gaussian perturbation method
2.3.2. Genetic algorithm
- mutations: random alterations
- crossover: the process of combining features of different inputs
- selection: choosing the most promising candidates
2.4. Dataset
2.5. Classification algorithm
3. Architecture
4. Methodology
4.1. Data preprocessing
- Cleaning: Records with missing values were removed; redundant data, columns, and irrelevant fields were removed to improve data quality and model performance [21].
- Label transformation: KDD Cup ’99 dataset comprises a large collection of network connection records, each labeled as either normal or as belonging to one of several attack categories, including DoS, R2L, U2R, and Probe attacks [22]. To simplify the classification task and align it with a binary intrusion detection scenario, we performed a label transformation. Specifically, all records labeled as attacks (i.e., any class other than normal) were grouped under a single label, "anomaly".
- Encoding of categorical variables: Non-numeric fields such as protocol_type, service, and flag are converted into numerical representations using label encoding [23], making them suitable for the model.
- Normalization: Numerical features are scaled to a common range (e.g., [0, 1] or z-scores) to ensure uniform contribution across features during model training.
- Dataset partitioning: The processed data is split into train and test subsets in the ratio of 70% and 30%, respectively.
4.2. RF classifier
4.3. Adversarial attacks
4.3.1. Gaussian method
4.4. Genetic algorithm
| Algorithm 1:Adversarial data generation using Genetic Algorithm. |
|
- Population size: The algorithm starts by initializing a population of adversarial candidates using the input data. The population size is defined by the parameter, which controls the diversity of candidate solutions. The was set to 20, meaning, 20 individuals are evaluated at each step. Larger population can enhance the diversity, but requires higher computational abilities.
- Number of generations: The parameter controls how many times the population will evolve. A value of 30 was set during our experimentation as it would ensure sufficient number of iterations to refine the population and produce potent adversarial examples.
- Mutation rate: The parameter determines the probability of randomly altering an offspring. For instance, a value of 0.1 (i.e., 10%) means that roughly one in ten new individuals will undergo mutation. This helps maintain diversity and avoid local minima.
- Fitness evaluation: Each individual in the population is assessed using a , which measures how effectively the perturbed input causes misclassification. In our implementaion, the individuals with the least fitness scores are considered most adversarial.
- Selection strategy: The algorithm uses fitness based ranking to selection the top individuals from the current generation to serve as parents for the next. This ensures that the strongest traits are propagated forward.
- Crossover mechanism: New offspring are generated by combining features from two randomly selected parents using a function. This simulates genetic recombination, enabling the emergence of new patterns.
- Mutation operation: After crossover, offspring may be modified by a function with the probability defined by . This step introduces novel variations and prevents premature convergence.
- Population update: The next generation is formed by stacking the selected parents with the new offspring.
- Logging: At each generation, we log the most desirable fitness scores, to monitor the optimization progress over time.
5. Results
5.1. Gaussian perturbation
5.2. Genetic algorithm
6. Discussion
7. Conclusion and Future Work
Author Contributions
Funding
Conflicts of Interest
Abbreviations
| NIDS | Network Intrusion Detection System |
| ML | Machine Learning |
| AML | Adversarial Machine Learning |
| RF | Random Forest |
| GA | Genetic Algorithm |
| KDD | Knowledge Discovery and Data Mining |
| DARPA | Defense Advanced Research Project Agency |
| LAN | Local Area Network |
| U2R | User to Root |
| R2L | Remote to Local |
| AI | Artificial Intelligence |
| MIT | Massachusetts Institute of Technology |
| DoS | Denial of Service |
References
- Elham Tabassi, Kevin J. Burns, M.H.A.D.M.M.J.T.S. A Taxonomy and Terminology of Adversarial Machine Learning. Technical report, National Institute of Standards and Technology, 19. 20 October.
- Alatwi, H.A.; Morisset, C. Adversarial machine learning in network intrusion detection domain: A systematic review. arXiv preprint arXiv:2112.03315, arXiv:2112.03315 2021.
- Biggio, B.; Roli, F. Wild patterns: Ten years after the rise of adversarial machine learning. In Proceedings of the Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, 2018, pp.
- Rosenberg, I.; Shabtai, A.; Elovici, Y.; Rokach, L. Adversarial machine learning attacks and defense methods in the cyber security domain. ACM Computing Surveys (CSUR) 2021, 54, 1–36. [Google Scholar] [CrossRef]
- Alhajjar, E.; Maxwell, P.; Bastian, N. Adversarial machine learning in network intrusion detection systems. Expert Systems with Applications 2021, 186, 115782. [Google Scholar] [CrossRef]
- Lin, Z.; Shi, Y.; Xue, Z. Idsgan: Generative adversarial networks for attack generation against intrusion detection. In Proceedings of the Pacific-asia conference on knowledge discovery and data mining. Springer; 2022; pp. 79–91. [Google Scholar]
- Apruzzese, G.; Andreolini, M.; Ferretti, L.; Marchetti, M.; Colajanni, M. Modeling realistic adversarial attacks against network intrusion detection systems. Digital Threats: Research and Practice (DTRAP) 2022, 3, 1–19. [Google Scholar] [CrossRef]
- Liao, H.J.; Lin, C.H.R.; Lin, Y.C.; Tung, K.Y. Intrusion detection system: A comprehensive review. Journal of network and computer applications 2013, 36, 16–24. [Google Scholar] [CrossRef]
- Tabassi, E.; Burns, K.J.; Hadjimichael, M.; Molina-Markham, A.D.; Sexton, J.T. A taxonomy and terminology of adversarial machine learning. NIST IR 2019, 2019, 1–29. [Google Scholar]
- Draper-Gil, G.; Lashkari, A.H.; Mamun, M.S.I.; Ghorbani, A.A. Characterization of encrypted and vpn traffic using time-related. In Proceedings of the Proceedings of the 2nd international conference on information systems security and privacy (ICISSP), 2016, pp.
- Sommer, R.; Paxson, V. Outside the closed world: On using machine learning for network intrusion detection. In Proceedings of the 2010 IEEE symposium on security and privacy. IEEE; 2010; pp. 305–316. [Google Scholar]
- Pawlicki, M.; Choraś, M.; Kozik, R. Defending network intrusion detection systems against adversarial evasion attacks. Future Generation Computer Systems 2020, 110, 148–154. [Google Scholar] [CrossRef]
- Pujari, M.; Sun, W. Fortifying Machine Learning-Powered Intrusion Detection: A Defense Strategy Against Adversarial Black-Box Attacks. In Proceedings of the International Congress on Information and Communication Technology. Springer; 2024; pp. 655–671. [Google Scholar]
- Martins, N.; Cruz, J.M.; Cruz, T.; Abreu, P.H. Adversarial machine learning applied to intrusion and malware scenarios: a systematic review. IEEE Access 2020, 8, 35403–35419. [Google Scholar] [CrossRef]
- Bartz-Beielstein, T.; Branke, J.; Mehnen, J.; Mersmann, O. Evolutionary algorithms. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2014, 4, 178–195. [Google Scholar] [CrossRef]
- Protić, D.D. Review of KDD Cup ‘99, NSL-KDD and Kyoto 2006+ datasets. Vojnotehnički glasnik/Military Technical Courier 2018, 66, 580–596. [Google Scholar] [CrossRef]
- Siddiqui, M.K.; Naahid, S. Analysis of KDD CUP 99 dataset using clustering based data mining. International Journal of Database Theory and Application 2013, 6, 23–34. [Google Scholar] [CrossRef]
- Belgiu, M.; Drăguţ, L. Random forest in remote sensing: A review of applications and future directions. ISPRS journal of photogrammetry and remote sensing 2016, 114, 24–31. [Google Scholar] [CrossRef]
- Pujari, M.; Cherukuri, B.P.; Javaid, A.Y.; Sun, W. An approach to improve the robustness of machine learning based intrusion detection system models against the carlini-wagner attack. In Proceedings of the 2022 IEEE International Conference on Cyber Security and Resilience (CSR). IEEE; 2022; pp. 62–67. [Google Scholar]
- Chaudhary, A.; Kolhe, S.; Kamal, R. An improved random forest classifier for multi-class classification. Information Processing in Agriculture 2016, 3, 215–222. [Google Scholar] [CrossRef]
- Amato, A.; Di Lecce, V. Data preprocessing impact on machine learning algorithm performance. Open Computer Science 2023, 13, 20220278. [Google Scholar] [CrossRef]
- Siddique, K.; Akhtar, Z.; Khan, F.A.; Kim, Y. KDD cup 99 data sets: A perspective on the role of data sets in network intrusion detection research. Computer 2019, 52, 41–51. [Google Scholar] [CrossRef]
- Shah, D.; Xue, Z.Y.; Aamodt, T.M. Label encoding for regression networks. arXiv preprint arXiv:2212.01927, arXiv:2212.01927 2022.





| Index | Feature | Description |
|---|---|---|
| 1 | duration | duration of the connection (in seconds) |
| 2 | protocol_type | type of protocol (TCP, UDP, etc.) |
| 3 | service | destination service (HTTP, FTP, Telnet, etc.) |
| 4 | flag | connection status (SF for successful, REJ for rejected, etc.) |
| 5 | src_bytes | number of bytes sent from source to destination |
| 6 | dst_bytes | number of bytes sent back from destination to source |
| 7 | land | determines whether source and destination IP addresses are the same (if yes, 1; if no, 0) |
| 8 | wrong_fragment | number of wrong or out-of-order packet fragments |
| 9 | urgent | number of urgent packets (packets with URG flag set) in a TCP connection |
| 10 | hot | number of hot indicators in a connection (a "hot" indicator refers to suspicious/unauthorized content in payload) |
| 11 | num_failed_logins | number of failed login attempts |
| 12 | logged_in | login successful = 1; login failed = 0 |
| 13 | num_compromised | number of compromised conditions in a connection |
| 14 | root_shell | 1, if root access is obtained in shell; 0, otherwise |
| 15 | su_attempted | if su command was attempted = 1; 0, otherwise |
| 16 | num_root | number of root accesses in the connection |
| 17 | num_file_creations | number of commands in the connection that create new files |
| 18 | num_shells | number of active shells (command interpreters) |
| 19 | num_access_files | number of times files were accessed |
| 20 | num_outbound_cmds | number of outbound commands in an FTP session |
| 21 | is_host_login | if host login = 1; 0, otherwise |
| 22 | is_guest_login | if guest login = 1; 0, otherwise |
| 23 | count | Number of connections to the same host as the current connection in a set interval (typically, 2 seconds) |
| 24 | srv_count | number of connections to the same destination service as the current connection in a set interval |
| 25 | serror_rate | percentage of connections, to the same host, with SYN errors, regardless of their service/port information |
| 26 | srv_serror_rate | percentage of connections, to the same host and same service, with SYN errors |
| 27 | rerror_rate | percentage of connections, to same host, with REJ errors (regardless of their service/port) |
| 28 | srv_rerror_rate | percentage of connections, to same host, with REJ errors (using the same service as the current) |
| 29 | same_srv_rate | percentage of connections to the same service |
| 30 | diff_srv_rate | percentage of connections to different services |
| 31 | srv_diff_host_rate | percentage of connections to the same service, but to different hosts, as a fraction of all connections |
| 32 | dst_host_count | Number of connections to the same host in the past 100 connections |
| 33 | dst_host_srv_count | Number of connections to the same host and same service in the past 100 connections |
| 34 | dst_host_same_srv_rate | percentage of connections to the same host and same service in the past 100 connections |
| 35 | dst_host_diff_srv_rate | percentage of connections to the same host and different services in the past 100 connections |
| 36 | dst_host_same_src_port_rate | percentage of connections to the same destination host that were made from the same source port |
| 37 | dst_host_srv_diff_host_rate | percentage of connections to the same service and the same destination host, that were made from different source hosts |
| 38 | dst_host_serror_rate | percentage of connections, to the same host, that have SYN errors |
| 39 | dst_host_srv_serror_rate | percentage of connections, to the same host and same service, that have SYN errors |
| 40 | dst_host_rerror_rate | percentage of connections, to the same host, that have REJ errors |
| 41 | dst_host_srv_rerror_rate | percentage of connections, to the same host and service, that have REJ errors |
| Hyperparameter | Value |
|---|---|
| n_estimators | 100 |
| random_state | 42 |
| class_weight | ’balanced’ |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).