Genetic Algorithm Based Feature Selection Technique for Optimal Intrusion Detection

In recent years, several industries have registered an impressive improvement in tech1 nological advances such as Internet of Things (IoT), e-commerce, vehicular networks, etc. These 2 advances have sparked an increase in the volume of information that gets transmitted from differ3 ent nodes of a computer network (CN). As a result, it is crucial to safeguard CNs against security 4 threats and intrusions that can compromise the integrity of those systems. In this paper, we pro5 pose a machine mearning (ML) intrusion detection system (IDS) in conjunction with the Genetic 6 Algorithm (GA) for feature selection. To assess the effectiveness of the proposed framework, we 7 use the NSL-KDD dataset. Furthermore, we consider the following ML methods in the modelling 8 process: decision tree (DT), support vector machine (SVM), random forest (RF), extra-trees (ET), 9 extreme gradient boosting (XGB), and naïve Bayes (NB). The results demonstrated that using the 10 GA algorithm has a positive impact on the performance of the selected classifiers. Moreover, the 11 results obtained by the proposed ML methods were superior to existing methodologies. 12


Introduction
The recent advances of technologies such as wireless networks, the Internet of 15 Things (IoT), Industrial Internet of things (IIoT) have sparked an increase in various 16 threats that can compromise the security, integrity, and reliability of those systems 17 [1]. There exist several mechanisms that organizations can use to guard themselves 18 against systems intrusions including, firewalls (considered as the first line of defense), 19 anti-malware, and anti-various software systems, intrusions detection and prevention 20 systems, etc. [2]. Among all the safety measures, an intrusion detection system (IDS) 21 is considered as one of the most crucial mechanisms capable of defending computer 22 networks against potential intrusions [3]. 23 At the top level of classification, IDSs are categorized as network-based and host- 24 based IDSs [4]. These systems can further be classified into knowledge-based IDS, 25 anomaly-based IDS, and hybrid-based IDS [5]. Network-based IDS is an IDS that is 26 deployed throughout a computer network's system. A host-based IDS is deployed on a 27 node within a computer network. A knowledge-based IDS, sometimes called a signature- 28 based IDS, uses existing patterns of attacks to detect current intrusions. One major 29 disadvantage of a knowledge-based IDS is that it has to be constantly updated with new 30 attack signatures. Anomaly-based IDSs are more dynamic in comparison to knowledge- 31 based IDSs because they don't depend on existing attack patterns to flag intrusions. 32 most optimal (best) features are selected for the modelling procedure [15]. There exist 48 two categories of FS methods, namely filter-based FS and wrapper-based FS. A filter-49 based FS select attributes based on the intrinsic nature of the data [16]. On the other hand, 50 a wrapper-based FS method uses a model to select the most optimal attributes subset. In 51 this instance, the best feature subset is selected using the performance of a model [17]. 52 In this research, we propose a wrapper-based FS method that is based on the Genetic 53 Algorithm (GA) [18]. Moreover, the fitness function that is implemented in the GA 54 uses the Extra-Trees (ET) classifier [19]. In the experiments, the GA generated 7 feature 55 vectors for the binary and multiclass classification procedures. These attributes vectors 56 were then used in the modelling process. We evaluated the performance of the models 57 using the selected features and we compared the results with existing ML methods. The 58 results demonstrated that using GA for FS leads to an increase in performance for the 59 classifiers considered in this research. 60 The remainder of the paper is organized as follows. Section 2 presents an account 61 of related work. Section 3 provides an overview of the NSL-KDD dataset. Section 4 62 presents a background on the ML algorithms that are used in this research. In Section 63 5, the proposed intrusion detection framework is presented. Section 6 provides the 64 experimental setup and a discussion of the results. Finally, Section 7 concludes this 65 paper. 66

67
In [20], the authors implemented a Deep Learning (DL) method for intrusion de-68 tection and prevention for software-defined networking (SDN) systems. The researcher 69 utilized the NSL-KDD dataset to assess the effectiveness of their proposed architecture. 70 The classifier that was used to model the IDS is the deep neural network (DNN) algo-71 rithm. The DNN used in this research is of type feed-forward whereby the information 72 flows in one direction. The experiments were carried using the binary classification 73 configuration and the accuracy was the main performance metric that was considered. 74 The results showed that the DNN achieved an accuracy of 75.75%. Although these 75 results are significant, this research did not implement a feature selection method that 76 would potentially increase the performance of the proposed DNN.
Su et al. [22] implemented a bidirectional long short-term memory (BLSTM) method 89 in conjunction with an attention mechanism (BAT-MC) for feature extraction using the 90 NSL-KDD dataset. The attention algorithm was used to capture the most important 91 attributes required for an optimal classification procedure. The experimental processes 92 were carried out for the BAT  MuliTree approach achieved an accuracy score of 85.2% for the binary classification task.

105
Although these results were superior in comparison to the base line models, the authors 106 conceded that more research needed to be conducted to implement a feature selection or 107 extraction method that could increase the performance of the MultiTree method. 108 Zhang et al. [24] presented a deep learning-based IDS using the NSL-KDD dataset.

109
In this research, the author utilized an autoencoder (AE) to extract the most important 110 attributes that were used by the DNN for classification. To gauge the performance of 111 the AE-DNN, the authors considered the accuracy, precision, recall, and F1-score. The  In [25], an IDS approach using an adaptive synthetic sampling (ADASYN) tech-  Tama et al. [26] presented the TSE-IDS, a two-step model for intrusion detection. In 130 the first phase, the TSE-IDS used several feature selection algorithms including particle 131 swarm optimization (PS), GA, and ant colony algorithm (ACO). The fitness criterion 132 used to select a feature set is the performance that was obtained on the pruning tree 133 (REPT) algorithm. In the second phase, the TSE-IDS uses an ensemble of classifiers that  Almasoudy [28] implemented a wrapper-based attribute selection method using    174 In this research, the NSL-KDD dataset [14] is utilized to assess the performance of for the testing procedure.The validation phase guarantees that the ML algorithms used 185 in this research are not prone to overfitting [30]. Table 1 provides the details about all

204
In this study, the following ensemble tree-based classifiers are implemented: Ran-   Let G = {(x n , y n ) : n = 1...q, x n ∈ R p , y n ∈ R} represent dataset that contains q with q 229 attributes denoted by y. Letŷ n be the prediction of the XGBoost classifier expressed as 230 follows: where f i represents a regression tree and f i (x n ) is the score value assigned to the i-th tree 232 of the n-th record in the dataset. The ultimate aim is the minimization of Equation (2). where l is denotes the loss over the entire dataset (or data subset). E is the loss function.
where sigma and upsilon are the regularization parameters that penalizes the num-  NB is as follows [39]: .., g n ) denote a dataset with n attributes. To predict the label, C r , a given instance in G, the NB algorithms computes p: The label, y, of the instance is therefore computed as follows: where y denotes the predicted class. In this research, we implemented the Bernoulli NB 246 (BernoulliNB) [39] as expressed in Equation (6).
where x denotes a given feature and y is the class. given data space [40]. Lets consider a case whereby an SVM is tasked to classify two sets 253 of data points as shown in Figure 2. In Figure 2a, hyperplanes that successfully splits 254 the data. However, aim for the SVM algorithm is to maximise the value of m in 2b in 255 order to find the optimal hyperplane h. However, in real-world applications, data points 256 are not always separable using a line. In other words, some tasks require a non-linear 257 solution. To overcome this issue, the SVM algorithm uses Kernels [41]. A kernel function 258 converts a low-dimensional input space (input data) into a higher dimensional space in 259 order for it to be separable. Some of the most popular kernels include the linear kernel, 260 polynomial kernel, the radial basis function (RBF) kernel [42].

274
• GA should conduct a Selection process.

275
• GA should conduct a Crossover process.

276
• GA should perform a Mutation procedure.

277
In this work, the fitness function that was used in the GA was implemented using 278 the ET method as described in Algorithm 1. The full implementation of GA that was 279 utilized in this research is outlined in Algorithm 2. Moreover, the GA was implemented 280 in two phases. In the first phase, the GA was applied to the NSL-KDD dataset to 281 generate the list of features ( that will be used in the binary 282 classification process. Table 3 provides the details of each feature vector that was selected 283 including the feature vector's length as well as the list of features that were produced. In 284 the second phase, the GA was applied to the NSL-KDD to compute the list of attributes 285 (G = g 1 , g 2 , g 3 , g 4 , g 5 , g 6 , g 7 ) that will be utilized in the multiclass classification process.

Algorithm 2 Implementation of the GA on NSL-KDD Dataset
Require: S, the NSL-KDD dataframe Require: G, a list (or array) containing the feature names of S. Require: T, denotes the label Require: F, an empty list that will temporarily store the feature subset Require: ni, the total number of iterations (maximum) START 1. Initialize the population Q, using G. 2. Compute the fitness measure using the ET 3. Compute the fitness using S, G, T and Q 4. calculate the optimal fitness score, b 5. Update F for k in range(ni) 6. Implement crossover 7. compute mutations 8. calculate the fitness 9. generate the optimal fitness score, b 10. Update F end for 11. The GA has converged F and b

288
The architecture of the proposed IDS is depicted in Figure 3. At the top level, 289 there 4 major components namely, data preparation, feature selection, modelling and 290 evaluation. In the data preparation or pre-processing phase, the numerical inputs 291 of the NSL-KDD dataset are scaled (normalized) using the Min-Max scaling [46,47] where v is the original value of a given attribute and v scaled is the scaled value.

295
Furthermore, the categorical attributes of the NSL-KDD (service, protocol_type and flag) 296 are transformed (encoded) into numerical features using the LabelEncoder functionality 297 that is found in Scikit-learn [48].

298
The second phase of the proposed IDS architecture involves the implementation of 299 the GA for feature selection (as outlined in Section 5.1). This phase generates sets of fea-   process [52]. Moreover, in this study the test accuracy (TAC) is considered as the most 334 important metric because it is a score that is computed on previously unseen data (Data 335 that is independent from the training and validation data subsets). Additionally, we plot 336 the confusion matrices (CMs) [53] for the multiclass classification process. A CM allows 337 us to assess how a specific model performed for a each class in the dataset.  in order to assess the quality of classification of each model using the AUC measure.

363
Focusing on the tree-based models and as depicted in Figure     types of intrusions. This trend could also be observed in Figure 9 which depicts the CM 386 generated by the DT algorithm using g 2 . The detection accuracy of minority classes (R2L 387 and U2R) is one of the major areas that we intend to investigate in our future works.

388
The increase of the detection accuracy of minority classes will improve on the overall 389 performance of the proposed models.

391
The results that were discussed in Section 6.3.1 and Section 6.3.2 demonstrate 392 that the use of GA for feature selection on the NSL-KDD dataset has the potential to 393 increase the performance of several classifiers. Moreover, Table 19 provides the details GA-ANN in [21]. In contrast to BAT [22], the GA-DT (v 3 ) obtained an increase of 6.70%.

408
In this research, we implemented ML-based algorithms to develop efficient IDSs  Further, the NSL-KDD dataset is an excellent benchmark dataset that is used for 421 developing ML-based IDSs, however, it has some shortfalls regarding novel attacks. In 422 our future work, we intend to apply our proposed framework to the following datasets, 423 UNSW-NB15 and TON_IoT. The UNSW-NB15 is more advanced, and more complex 424 than the NSL-KDD. This will allow us to assess the effectiveness of our method on