Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

Content-based Spam Email Detection Using N-gram Machine Learning Approach

Version 1 : Received: 13 September 2021 / Approved: 14 September 2021 / Online: 14 September 2021 (11:36:36 CEST)

How to cite: Hossain, S.M.M.; Sarker, I.H. Content-based Spam Email Detection Using N-gram Machine Learning Approach. Preprints 2021, 2021090236 (doi: 10.20944/preprints202109.0236.v1). Hossain, S.M.M.; Sarker, I.H. Content-based Spam Email Detection Using N-gram Machine Learning Approach. Preprints 2021, 2021090236 (doi: 10.20944/preprints202109.0236.v1).

Abstract

Recently, spam emails have become a significant problem with the expanding usage of the Internet. It is to some extend obvious to filter emails. A spam filter is a system that detects undesired and malicious emails and blocks them from getting into the users' inboxes. Spam filters check emails for something "suspicious" in terms of text, email address, header, attachments, and language. However, we have used different features such as word2vec, word n-grams, character n-grams, and a combination of variable length n-grams for comparative analysis in our proposed approach. Different machine learning models such as support vector machine (SVM), decision tree (DT), logistic regression (LR), and multinomial naïve bayes (MNB) are applied to train the extracted features. We use different evaluation metrics such as precision, recall, f1-score, and accuracy to evaluate the experimental results. Among them, SVM provides 97.6 \% of accuracy, 98.8\% of precision, and 94.9\% of f1-score using a combination of n-gram features.

Keywords

Spam Detection; Feature extraction; N-grams; Machine Learning

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our diversity statement.

Leave a public comment
Send a private comment to the author(s)
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.