Content-based Spam Email Detection Using N-gram Machine Learning Approach

Recently, spam emails have become a signiﬁcant problem with the expanding usage of the Internet. It is to some extend obvious to ﬁlter emails. A spam ﬁlter is a system that detects undesired and malicious emails and blocks them from getting into the users’ inboxes. Spam ﬁlters check emails for something ”suspicious” in terms of text, email address, header, attachments, and language. However, we have used diﬀerent features such as word2vec, word n-grams, character n-grams, and a combination of variable length n-grams for comparative analysis in our proposed approach. Diﬀerent machine learning models such as support vector machine (SVM), decision tree (DT), logistic regression (LR), and multinomial na¨ıve bayes (MNB) are applied to train the extracted features. We use diﬀerent evaluation metrics such as precision, recall, f1-score, and accuracy to evaluate the experimental results. Among them, SVM provides 97.6 % of accuracy, 98.8% of precision, and 94.9% of f1-score using a combination of n-gram features.


Introduction
In recent years, web security is becoming one of the most critical issues. Most of our daily services start using the internet, mobile computing, and electronic media. Email is one of the mediums to communicate and increases in volume with the increasing use of the internet. Spamming is one of the most straightforward attacks in email messaging. Besides, users frequently receive annoying spam messages and malicious phishing messages by subscribing to different websites, products, services, catalogs, newsletters, and other types of electronic communications (1; 8). In some cases, spam email is produced by mass-mailing viruses or Trojan horses. According to the China Anti-Spam Alliance's new survey on data, a typical Internet user receives 35 emails per week on average among which, 41% of emails are spam emails. The presence of such spam messages wastes time as well as bandwidth on internet connections. Furthermore, they are often associated with offensive content and spread computer viruses. Due to these obligations, cyber specialists are devoted to developing accurate spam detection in digital communication.
Moreover, there are many solutions to filter spam, e.g., the blacklist and white-list filtering techniques, decision tree based approaches, email address based approaches, and machine learning based methods. The majority of them rely heavily on text analysis of the content of an email. As a result, there is a growing demand for effective anti-spam filters that automatically identify and remove spam messages or alert users to possible spam messages. However, spammers always investigate the loopholes of existing spam filtering techniques. They have introduced a new design for spreading spam emails in a wide range. Therefore, the existing system does not function against them. Tokenization attack sometimes misleads spam filtering by adding extra spaces. Therefore, email contents are needed to be structured (14). Moreover, inspite of having the highest accuracy in machine learning based spam email detection(3; 17), false positive (precision of 92.9%) is an issue due to one-shot detection of email threats. Addressing the false positive issues and changes in various attack design, the stop words and other unwanted information are removed from the texts for further analysis in our proposed approach. After pre-processing, these texts go through numerous feature extraction methods, such as word2vec, word n-gram, character n-gram, and a combination of variable length n-gram. Different machine learning techniques such as support vector machine (SVM), decision tree (DT), logistic regression (LR), and multinomial naïve bayes (MNB) (15) are applied on these matrices to perform the classification of the emails. The primary contributions of this paper are the following: (i) to create a content-based spam filter that can classify spam and ham e-mails.
(ii) to analyze numerous feature extraction methods, such as word2vec, word n-gram, character n-gram, and combination of variable length n-gram. (iii) to evaluate the performances of numerous experiments and achieve the best performance using support vector machine (SVM), decision tree (DT),logistic regression (LR) and multinomial naive bayes (MNB) with proper features.
The rest of the paper is organised as follows. Section 2 discusses the related works; proposed approach for detecting e-mail spam is presented in Section 3; results and analysis are demonstrated in Section 4; and finally, the paper is concluded in Section 5.  7) proposed a spam mail filtering system based on ngram indexing to support vector machines. They practiced with emails obtained from various users and performed the filtering procedure with SVM classifier. Kaur et al. (8) proposed a spam detection technique using N-gram analysis and machine learning techniques. The N-grams that are built are used to predict unlabeled data.

Related works
Ahmad et al. (9) proposed a method achieving 96% of accuracy, in which an optimal subset of features is chosen for the learning process and support vector classifier is used to classify. Sarker et al. (16) performed an effectiveness analysis of machine learning security modeling with optimal features on a broad scale. Nayak et al. (10) proposed a method for spam email detection, which employs a hybrid bagging approach as feature and combined the Naive Bayes and Decision Tree machine learning algorithms as classifiers which achieve overall 88.12% of accuracy. Sheu et al. (11) proposed a method concentrating on email header analysis using a decision tree classifier to search for spam association guidelines at first. Next, an effective systematic filtering process is generated based on these association laws. Chen et al. (12) proposed a systematized spam filtering method based on decision tree data mining methodology to evaluate spam association rules and apply these rules to create an effective spam filtering method. This method provides a precision of 96%. Kumar et al. (13) proposed an approach that verifies the email header and URL as well as analyzes the body texts using different rules. They also employed Bayesian classifier and Apriori algorithm to classify files and attachments. Khamis S.A et al. (17) proposed a framework working with email header features for email spam detection by analyzing two email datasets. Support Vector Machine was used to classify email, which provides 88.80% of accuracy.

Methodology
The proposed system for spam e-mail detection is depicted in Figure 1. It has four phases: preprocessing, feature extraction, training, and prediction. Several preprocessing steps are performed before extract features from the text. Feature extraction techniques are used to extract features from the preprocessed texts. In the training phase, these extracted features are used to train machine learning classifiers (such as support vector machine (SVM), decision tree (DT), logistic regression (LR), and multinomial naïve bayes (MNB). Finally, in the prediction stage, the contents of the emails are predicted as spam or ham.

Preprocessing
The preprocessing step involves eliminating inconsistencies and mistakes from raw data to make it more understandable. As a result, we must preprocess our data before feeding it into our model. Consider the following email: "hello! We want to make localized version of the software...." This email can be preprocessed in the following manner: • Special character removal: Each text is stripped of special characters such as (,=,>), numbers, and punctuation. After removal of those special characters from the content of the above email, the text would become "hello we want to make localized version of the software". • Stop words removal: Words like "the," "an," "this," "a" etc. that aren't needed for recognizing spam or ham emails and hence those words are excluded. After removing the stop words, the text of the given e-mail becomes -"want make localized version software". • Tokenization: Tokenization is the process of breaking down a large text into smaller tokens. Tokenization provides a list of words like such as ("want", "make", "localized", "version", "software") for the above given e-mail. • Lemmatization: Lemmatization generally aims to eliminate only inflectional endings and restore the lemma, which is the base or dictionary form of a phrase. After lemmatizing the text of the e-mail, we get words like ("want", "make", "localize", "version","software").

Feature extraction
In feature extraction phase converts raw data into useful knowledge by reformatting, merging, and converting primary features into new ones. We have used the following feature extraction techniques:

N-gram:
An n-gram is an n-tuple or set of n words or characters those follow one another. The number of consecutive terms that can be treated as one gram is indicated by the letter 'n'. As machine learning algorithms cannot access raw text. We have then applied the word n-grams, character n-grams, and a combination of variable length n-grams to gain a better understanding of the sentences. The texts are not in structured forms.
So, we convert text into numerical vectors using TF-IDF 1 . Word n-gram: Word n-grams deal with tuples or group of words. The word n-gram representation of an e-mail is shown in Table 1   Table 1. Word n-gram representation of an e-mail.

Sentence
"We want to make localized version of the software" Uni-gram 'we', 'want', 'to', 'make', 'localized', 'version', 'of','the,'software' Bi-gram 'we want', 'want to', 'to make', 'make localized', 'localized version', 'version of', 'of the', 'the software ' Tri-gram 'we want to', 'want to make', 'to make localized', 'make localized version', 'localized version of','version of the','of the software' Character n-gram: A text that is represented by a sequence of characters is known as a character n-gram. Unlike word n-grams, character n-grams can detect a word's identification and possible neighbors and the word's morphological makeup. Table 2 shows the character n-gram representation of an email.  Combination of variable length n-gram: The variable length is not pre-defined by a combination of variable length n-grams. It can merge unigrams and bigrams, or trigrams and fivegrams. Table 3 shows the combination of variable length n-gram representation of an e-mail.

Word2vec:
Word2vec uses a neural network model to learn word associations from a large corpus of text. This type of model can identify synonyms and suggest new terms for a sentence. Word2vec correlates each different word with a specific set of integers known as a vector, as the name indicates. We generate a bag of words model out of the entire corpus, with each word being a vector.

Training
The features gathered in the previous phase are used to train a machine learning model. Support vector machine (SVM), decision tree (DT), logistic regression (LR), and multinomial naïve bayes (MNB) are used to train the extracted features in our proposed approach. In the following subsections, we discuss these algorithms.

Support Vector Machine:
Support vector machine is a supervised machine learning model. It generalizes between two classes. The first goal of the SVM is to seek out a hyperplane that will distinguish between the 2 classes. The equation of hyper-plane is shown in Eq. 1: where, w is the weight factor and b is that the bias, and x is the feature vector of sample i.

Logistic Regression:
Logistic regression works effectively with binary classification problems. The activation sigmoid function's mathematical equation that results to binary classification is shown below in Eq. 2: Now, in the above equation, The model's co-efficient produced using Maximum Likelihood Estimation is w0, w1, w2,..., wn, and the features or independent variables are x0, x1, x2,..., xn in the preceding equation. Finally, the binary outcome likelihood is calculated using z in the previous equation, where the possibilities are separated into two categories based on the given information (x).

Decision Tree:
The decision tree has two types of nodes: external and internal nodes. External nodes represent the decision class, while internal nodes have the features required for cat-egorization. A top-down strategy was used to examine the decision tree, which split homogeneous data into subsets. Its entropy is calculated using Eq.4 which defines sample homogeneity.
The entropy of a sample in the training class is E(S), and the probability of a sample in the training class is pi. Entropy was used to determine the splitting consistency.
During the split, all of the features are considered to identify the appropriate split for each node. Random state 0 controls the recombination of the features.

Multinomial naïve bayes:
In Natural Language Processing(NLP), the multinomial naive bayes algorithm is a common probabilistic learning method. It assesses the probability of each tag for each sample and returns the tag with the highest probability. The Bayes theorem, as established by Bayes, calculates the likelihood of an event occurring based on prior knowledge of the conditions involved. In Eq5 the formula is shown.
When a predictor B is already available, we evaluate the likelihood of sophistication A. P (B) denotes the probability distribution of B, P (A) denotes the prior probability of sophistication A, and P (B|A) denotes the probability of predictor B given the probability of class A.
The trained classifier models are then used for predicting the text contents as spam or ham.

Evaluation Results and Analysis
We have implemented all the experiments in Intel Core i5 processor with GPU 8 GB RAM on Python 3.7 on jupyter notebook.
The 'spam or ham e-mail' dataset is collected from Kaggle 2 , an online data publishing source. The dataset contains 5731 e-mails of which 1369 e-mails are spam and 4362 e-mails are ham.The training set contains 80% of the total data, while the testing set only contains 20% of the total data.
We use various machine learning algorithms to assess the system's performance, including SVM, MNB, DT, and LR. Again, various feature extraction approaches are used to test a variety of models. Table 4 shows the performance evaluation of different feature extraction methods. In the word n-gram feature, the SVM classifier provides the highest accuracy of 95.4% and precision of 98.2% considering bi-gram. In contrast, the logistic regression classifier achieves the best performance of 95.7% accuracy and 98.2% precision considering tri-gram. Again, for character n-gram, naive bayes classifier provides the highest 93.8% accuracy and 100% precision considering bi-gram and SVM achieves highest 95.9% accuracy and 97.5% precision considering trigram. We also combine variable-length n-gram features and find that the combination of (uni-gram+bi-gram) provides the highest accuracy of 97.6% and precision of 98.8% using the SVM classifier. For the Word2vec method, the logistic regression classifier provides the highest 83.1% of accuracy and 83.6% of precision.
We consider the uni-gram, bi-gram, and five-gram to investigate the features from email content. Using uni-gram, we alleviate the unwanted tokenization of white spaces. It decreases the possibility of spam considered to be ham (FN) or vice versa (FP) and provides the highest accuracy as shown in Figure 2. In conclusion, SVM has proven to be the best classifier and is effective in recognizing capabilities for our dataset due to robustness in high dimensions of features. With the combination of uni-gram and bi-gram, SVM achieves the highest accuracy of 97.6%, the precision of 98.8%, and f1-score of 94.9%. A receiver operating characteristic (ROC) curve shows how well a classification  model performs over various classification thresholds. We have drawn the ROC curve for detecting spam e-mails for the combination of uni-gram and bi-gram using different classifiers as shown in Figure 3. Logistic regression and SVM classifier performs better in perspective of the area under the ROC curve, and it is 0.91 in both cases. Further, we also compare our proposed method with some benchmark works of (3) and (17) as shown in Table 5. Our proposed method outperforms the two benchmark methods with an accuracy of 97.6% and precision of 98.8% using SVM classifier.

Conclusion
Accurate spam detection is an integral part of email communication. Despite accurate detection of spam (3; 17), false positive rate is also an issue. To do so, we present a content-based spam email detection approach. We use multinomial naïve bayes, logistic regression, support vector machine, and decision tree classifiers for learning the various features from the contents of emails. For comparative research, we use word n-gram (bi-gram, tri-gram), character n-gram (bi-gram, tri-gram), the combination of variable length n-grams (uni-gram and bi-gram, bi-gram and five-gram), and word2vec features. Among them, SVM achieves the best performance of 97.6% accuracy, 98.8% precision, and 94.9% f1-score for the combination of variable length n-gram (uni-gram and bi-gram). In the future, we can extend our work by analyzing the features using context-based machine learning.