A Decision Tree Based Intrusion Detection System for Identification of Malicious Web Attacks

In today’s world, cyber attack is one of the major issues concerning the organizations that deal with technologies like cloud computing, big data, IoT etc. In the area of cyber security, intrusion detection system (IDS) plays a crucial role to identify suspicious activities in the network traffic. Over the past few years, a lot of research has been done in this area but in the current scenario, network attacks are diversifying in both volume and variety. In this regard, this research article proposes a novel IDS where a combination of information gain and decision tree algorithm has been used for the purpose of dimension reduction and classification. For experimental purpose the NSL-KDD dataset has been used. Initially out of 41 features present in the dataset only 5 high information gain valued features are selected for classification purpose. The applicability of the selected features are evaluated through various machine learning based algorithms. The experimental result shows that the decision tree based algorithm records highest recognition accuracy among all the classifiers. Based on the initial classification result a novel methodology based on decision tree has been further developed which is capable of identifying multiple attacks by analyzing the packets of various transactions in real time.


Introduction
Internet Information resources are actively growing, penetrating many spheres of social life. With the increasing dependence of human over the Internet technology and World Wide Web, it becomes challenging to secure the information and confidential data flowing over the network. Valuable information attracts more attackers and is always prone to a huge number of attacks over the network [7]. The rapid use of internet creates some serious issues: -Man in the Middle Attacks: This type of attack occurs when an attacker gets access over a twoparty transaction. This type of cyber security threats are made by cyber-criminals who set up fake public Wi-Fi networks or install malware on a victim's network.
-Internet of Things Insecurity: Internet of Things connects network devices across the world. Weak or guessable password, insecure network services and ecosystem interfaces, insecure data transfer and storage are the major threats for IoT.
-Changing the Signature of Attacks: In today's world, the nature of the attacks are changing day by day with the increase in data and these new signature based attacks are really difficult to detect [13].
In the area of cyber security, the role of IDSs are to analyze all the upcoming and outgoing packets of a system or a network and restrict unauthorized access. In the past few decades, researchers have adopted various machine learning and deep learning platforms for the construction of intelligent IDSs but due to the divergence nature of various attacks it is always challenging to select appropriate features with minimum detection time. The learning techniques of IDSs are broadly classified into three categories: supervised, unsupervised and semi supervised learning [1]. In supervised learning all the instances in the dataset have a specific label, in unsupervised learning the data has no label, where as semi supervised is a combination of both supervised and unsupervised learning. Based on the nature of the attacks the IDSs are further classified into three types [6] namely signature based, anomaly based and hybrid. Signature based or misuse based IDSs are used to identify known attacks, that means detection of attacks by looking for specific patterns, such as byte sequences in the network traffic. As a result it is unable to detect zero day attacks .Anomaly based IDSs are used to detect various unknown attacks. It creates the model according to the system behavior and identifies anomalies which deviates from normal behavior. Increasing false alarm rate is the major issue of anomaly based IDS. Hybrid IDS is a combination of both signature and anomaly based IDS. Snort is one of the popular and freely available IDS used in UNIX or Linux operating system. So the main contribution of this research work is: -Selection of a more upright feature set containing only 5 high gain valued features using information gain.
-Exhaustive investigation on the applicability of the derived features is carried out through various machine learning frameworks. This establishes the feasibility and part superiority of the proposed methodology.
-In tune with the previous contributions, a novel methodology based on decision tree is further proposed to identify various attacks in real time. So, the rest of the paper is structured as follows, section 2 reflects the literature survey portion, section 3 describes the proposed methodology, section 4 analyzes the results along with various comparisons and the conclusion is presented in section 5.

Literature Survey
Malicious web attack poses various significant threats for our network and becomes a challenging issue. As discussed earlier, current protection techniques are potentially designed based on three types of mechanism: signature based, anomaly based and hybrid. Ioulianou et al [10]. proposed a signature based IDS for IOT networks which involved both centralized and distributed IDS modules using Cooja simulator. A Denial of Service (DOS) attack scenario was created on IOT devices. They concluded that it supports application development for Contiki OS, but did not import the IDS modules to Contiki OS to test its performance in real world IOT environment. In another work Almutairi et al. [3] introduced a solution in parallel processing environment with the help of most frequent features and an updating agent. They proved that this module can be used for both host and network based IDS. It is a challenging task to identify new signature based attacks which are not frequent enough. Shah et al [19]. implemented a signature based IDS using SNORT and WTNPCAP in windows environment. Using this powerful software SNORT, real time traffic analysis has been performed and packet logging being carried out. In this work, there is a scope of improvement in processing time. In [16], the authors proposed an IDS using several Decisions trees and decision Rules. The prediction accuracy of classifiers had been evaluated using 10-fold cross validation. They have extracted 65534 instances from NSL-KDD dataset for their experiment and all the 41 features of the dataset were used for classification which are very much expensive to calculate. A combined Multi-Layer Perceptron (MLP) and Decision tree based technique were introduced by Esmaily et al. [9]. All the 41 features of KDDCup99 dataset were used for experimental purpose that not only generated a significant low false alarm value but were also difficult to implement in real time traffic analysis. In another interesting work Rai et al. [18] proposed a new algorithm based on Decision Tree Split method and made a comparative study with some existing classifiers like C4.5, CART etc. Depending on information gain values, the top 16 features were selected out of 41 features of NSL-KDD dataset that estimated a low yielding accuracy of 79%. Bajaj et al. [5] showed a completely different approach where several feature reduction techniques like Information Gain Attribute Evaluation, Gain Ratio Attribute, and Correlation Attribute Evaluation were used. They showed the applicability of various feature reduction algorithms with the help of different classifiers like Na¨ıve Bayes, SVM, MLP etc , but the accuracy level was not up to the mark. In [17] the authors used Principal Component Analysis (PCA) algorithm for dimension reduction. After dimension reduction the 16 relevant features of KDD99 dataset were further passed into Na¨ıve Bayes classifier for prediction but the experiment was conducted with a few number of instances. Lakhina et al. [14] used PCA along with various machine learning tools for Effective Anomaly-Based Intrusion Detection. After applying PCA the 8 foremost features of NSL-KDD dataset were fed through various machine learning frameworks and the model outperforms. In [11] the authors have built a model with the help of Support Vector Machine (SVM). The top 30 features of NSL-KDD dataset were selected using information gain technique which were further fed through SVM for classification. Their proposed model gives more than 90% accuracy. In [4] the authors have used a probabilistic approach for analyzing intrusion in the network traffic .This approach deploys first order Markov chain process to predict the anomalous activities in the network. The total procedure was carried out in three steps such as defining the states, building the state transition matrix and probability distribution of event occurrence. The FOMC model performance was evaluated through KDDCup99 dataset and their accuracy was up to the mark. Shanmugavadivu et al. [20] introduced a system based on fuzzy logic to detect an intrusion detection behavior within a network. They have used automated strategy for generation of fuzzy rules, which are obtained from the definite rules using frequent items. KDDCup99 dataset was taken for the experiment and after dimension reduction the model gives approximately 90% detection rate.
On surveying and analyzing the above works, it is concluded that though various research attempts have been done in this area, there is still a scope for improvement. Since dimension reduction and proper feature selection place a pivotal role for real time traffic analysis, so improvement can be done in feature reduction and enhancing recognition accuracy.

Proposed Methodology
In today's world safety and privacy of data is a major issue and IDSs provide a first level defense for our system and network .Various rule based procedure or machine learning frameworks are used for the construction of intelligent IDS . This work uses a decision tree based classifier for detecting various malicious attacks. As illustrated in Figure 1 the proposed methodology consists of 4 primary sections namely i) Dataset Selection ii) Dimension Reduction iii) Preprocessing and iv) Classification. .

Fig.1. Detailed Decision tree based model
This paper uses the NSL-KDD dataset which not only contains 125973 number of records but also contains 22 different types of modern attack. In the next phase, out of 41 features only top 5 relevant features have been selected using information gain technique. Finally after preprocessing, the selected features are evaluated through various machine learning frameworks.The individual segments of the proposed methodology are briefly explained in the following subsections

Dataset Description
The NSL-KDD dataset [2][21] [15] [8] was generated in 2009 that has significant advantage over KDD'99 dataset . It reduces the size of the KDD99 dataset by eliminating the redundant and duplicate records from the dataset and hence allowing the complete dataset to be used for experimental purpose. The data files containing in NSL-KDD dataset cover KDDTrain+ for training and KDDTest+ and KDDTest-21 for testing. The number of records in each dataset is shown in table 1.

Dimension Reduction
Feature ranking and dimension reduction play a pivotal role in any machine learning framework [12]. Information Gain Attribute Evaluation is one type of filter method for feature reduction. It Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 9 July 2020 doi:10.20944/preprints202007.0191.v1 evaluates the worth of an attribute by measuring the information gain with respect to the target attribute and ranks the attributes according to their gained values in descending order. It basically measures how much feature contributes in decreasing the overall entropy .Entropy is defined as: where Pi is the probability of the i th class in the dataset, and Entropy shows the degree of impurity . The closer to 0 it is, the lesser is the impurity in the dataset.
Information gain attribute evaluation algorithm has been applied on NSL-KDD dataset, which estimated the gain values of all the 41 features in descending order. We have selected the top ranked 5 features to build our model for better accuracy. The feature selection algorithm is described below: i.

Preprocessing of The Dataset
Among the top 5 selected features mentioned above, service and flag are categorical in nature where as the others are numeric. As the various machine learning models taks only numeric value, we have to convert the non-numeric features into numeric form. The one hot encoding technique has been used to convert the non numeric features into numeric pattern. Categorical variable Flag contains 11 different types of flag values: 'OTH', 'S1', 'S2', 'RSTO', 'RSTRs', 'RSTOS0', 'SF', 'SH', 'REJ', 'S0', 'S3' where the flag value SF can be converted to binary vector (0,0,0,0,0,0,0,0,0,1,0) using one hot encoding method. Similarly the service field which contains 70 different types of service values are also encoded into binary form.

Methodology For Classification
The top 5 features of NSL-KDD dataset are further fed through decision tree based model for classification purpose. Decision tree [22] is a supervised learning approach where each internal node represents a test on an attribute; each edge denotes an outcome of the test and each terminal nodes holds a class level. It is a simple decision making algorithm does not requires any normalization and scaling procedure. The working principal of decision tree construction is shown in algorithm 2.

Algorithm 2 Decision Tree Construction
Input: Reduced NSL-KDD dataset Output: A Decision tree using the dataset 1: Start 2: Calculate the Gini index for each attribute using Gini index ← 1 − (Pi) 2 , where Where Pi is being the probability of i th class in a given branch 3: Split the dataset using a: Select the root node based on the attributes that has the lowest Gini index b: Determine the splitting criteria c: Partition the dataset into two subsets based on the splitting criteria 4: Repeat step 1 and 2 on each subset until leaf nodes can be found in all the branches of the tree 5: Stop

Experimental Setup and Evaluation Matrices
All the experiments have been conducted on Anaconda jupyter IDE platform in python 3.6. 4 environment with Intel(R) core (TM) i3-7020U CPU @ 2.00 GHz configuration and 4GB memory. The result analysis section is divided into two subsections namely binary and multiclass classification. In binary classification, dataset contains two classes: normal and attack. On the other hand in multi class classification, the attack type is further divided into 4 subclasses: DOS, U2L, R2L and Probing. For comparison purpose, the features have been evaluated through various machine learning frameworks. Finally, for real time classification a transaction based procedure along with some result outcome is shown. We have used 5 important performance indicators to evaluate the performance of our model.

Accuracy (AC):
It is the percentage ratio of correctly classified instances by the total number of instances.

Precision:
It is the ratio of relevant instances by the retrieved instances.

Recall:
It is the ratio of relevant instances that have been retrieved by the total amount of relevant instances.
4. F1 measure: It is the weighted average of precision and recall.
5. Support: Support is the true instances that lie in that class where, -TP (True positive): is the number of positive instances that are corrected classified.
-TN(True negative):is the number of negative instances that are correctly classified.
-FP(False positive):is the number positive instances that are wrongly classified.
-FN(False negative):is the number of negative instances that are wrongly classified.
Hence a good IDS always have a good detection rate and low false alarm rate.

Result Analysis
In order to perform the result analysis the 5 relevant features of KDD Train+dataset have been used for training purpose and the testing for both binary and multiclass categories were performed through KDDTest+ and KDDTest-21 dataset respectively. The detection outcome of various classifiers along with its precision, recall, F1 measure and support values are shown in the following subsections.

Binary Classification
In binary classification, all the 4 type of attacks have been categorized into one group called attack and the rest are defined as normal type. The model has been trained Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 9 July 2020 doi:10.20944/preprints202007.0191.v1 through KDDTrain+ dataset with 4 different machine learning framework Na¨ıve bayes, Logistic regression, Random forest and decision. Table 3 summarizes the corresponding accuracy, precision , recall, F1-measure and support values of KDDTest+ dataset for the above mentioned classifiers. The experimental result shows that the decision tree model records 90.9% accuracy with high precision , recall, F1 measure and support values among all other classifiers.     Table 7 shows the model construction time for all the mentioned classifiers. According to the table, it is shown that a proper feature reduction minimizes the model buildup time to a particular level. Though the Logistic Regression and Na¨ıve Bayes classifier take less time but there is a drastic reduction when the accuracy of the model is concerned. The result also implies that there is an optimal trade off between model build up time and detection rate which is achieved by decision tree method. In the next part, the detection rates for individual attack types have been estimated.   According to the result analysis section the proposed model creates a complete tradeoff between the model construction time and detection rate. In order to justify the aforementioned statement we have prepared 10 transactions by taking random data from both of the test dataset which contains 100 packets each. Random selection helps to maintain the generic nature of our proposed technique. Instead of building a model at run time, few coefficient values have been obtained by using algorithm 2 in model buildup time that are further incorporated to estimate the detection rate of randomly selected sample. Figure 3 depicts the transaction time verses accuracy graph where the X-axis defines the various transactions and the red and green line along with Y-axis shows the corresponding transaction time and accuracy. According to the graph all the transactions have been performed within a time period between 0 to 18 milliseconds with a high detection rate.

Conclusion
In this work, we have proposed a complete IDS based on the combination of information gain and decision tree based algorithm. The proposed method not only reduces the features up to a particular significant level, but also produces a high detection rate in both binary and multiclass categories. The main advantages of the aforementioned technique are, it creates a complete tradeoff between the detection rate and model construction time. It uses only 5 relevant features instead of 41 features which not only reduces the complexity of the model but also reduces the detection time This work also tried to incorporate most recent works that have been done in the current scenario and shows the efficiency of our model with respect to detection rate, precision, recall, F-measure and support. In future, we can effectively improve the detection rate and model construction time by introducing the concept of various deep learning and ensemble learning method.