Submitted:
28 November 2024
Posted:
28 November 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Objectives
- Compare the performance of traditional models and transformer models in the email classification task for phishing and non-phishing labels, evaluating their efficacy using quantitative metrics such as precision, recall, F1-score, and accuracy.
- Explore the enhancements brought by the implementation of transformer models in text classification tasks through an analysis of classification accuracy and their ability to process complex and diverse content.
- Conduct a thorough analysis of the instances of failed classifications performed by both traditional models and transformer models. By identifying recurring patterns and root causes of errors, this objective aims to propose actionable improvements and refinements for future phishing detection methodologies, enhancing their effectiveness and reliability.
3. Related Work
3.1. Analysis of Phishing Websites
3.2. Analysis of Phishing URLs
3.3. Analysis of The Content of Phishing Emails
3.4. Deep Learning for Phishing Detection
3.5. Transformer Models for Phishing Detection
3.6. Datasets Used in Previous Investigations
| Year | Dataset Name | Linear Regression | Sequential | Decision Trees | Random Forest | Naive Bayes | CNN | roBERTa |
|---|---|---|---|---|---|---|---|---|
| 2020 | Email Classification | 94.08 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2023 | Email Spam Classification | 0 | 86.2 | 0 | 0 | 79.87 | 0 | 78.57 |
| 2001 | Enron Spam Data (No Code) | 0 | 0 | 0 | 0 | 95 | 0 | 0 |
| 2018 | Fraud Email Dataset | 92 | 0 | 0 | 0 | 97 | 0 | 0 |
| 2023 | Phishing Email Detection | 0 | 0 | 93.1 | 0 | 0 | 97 | 99.36 |
| 2023 | Phishing-Mail | 0 | 0 | 92.82 | 0 | 0 | 99.03 | 96.81 |
| 2023 | Pishing Email Detection | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2018 | Pishing-2018 Monkey | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2019 | Pishing-2019 Monkey | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2020 | Pishing-2020 Monkey | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2021 | Pishing-2021 Monkey | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2022 | Pishing-2022 Monkey | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2018 | Private-pishing4mbox | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2023 | Spam (or) Ham | 0 | 99.72 | 0 | 0 | 96.9 | 0 | 0 |
| 2020 | Spam Classification for Basic NLP | 0 | 81 | 96.77 | 0 | 98.49 | 0 | 98.33 |
| 2021 | Spam Email | 0 | 96.67 | 97.21 | 0 | 99.13 | 0 | 0 |
| 2021 | Spam_assasin | 0 | 0 | 98.6 | 98.87 | 0 | 0 | 0 |
| 2024 | Phishing Validation Emails Dataset | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2022 | NLP Spam Ham Email Classification | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2022 | Phishing Email Data by Type | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2023 | Phishing-Mail | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4. Data and Methods
4.1. Data Collection
- Phishing Email Detection
- Phishing Email Data by Type
- Email Spam Detection Dataset (classification)
- Enron Spam Data
- private-phishing4.mbox
- phishing-2018 monkey
- phishing-2019 monkey
- phishing-2020 monkey
- phishing-2021 monkey
- phishing-2022 monkey
- Phishing Email Data
- self-promoted phishing data
- phishing validation dataset
4.2. Applied Models
- Logistic Regression: A linear model utilized for binary classification. It uses probability to determine if the given data can be classified into a particular label.
- Random Forest: A model that uses decision trees for training and learning. After creating these decision trees, it utilizes predictions to improve the precision of the responses.
- Support Vector Machine (SVM): A model that classifies data by determining which hyperplane best separates the classes in the feature space.
- Naive Bayes: A statistical model that uses Bayes’ theorem with the assumption of independence between features. It works well with large datasets.
- BERT (bert-base-uncased): The BERT (Bidirectional Encoder Representations from Transformers) model is a pretrained model that utilizes bidirectional transformer logic, allowing it to analyze provided text data in both directions.
- distilBERT (distilBERT-base-uncased): A compact version of BERT that maintains about 97% of its accuracy while consuming fewer resources.
- XLNet (xlnet-base-cased): A model that generalizes BERT using permutation-based prediction, capturing dependencies without the constraint of conditional independence.
- roBERTa (roBERTa-base): roBERTa (Robustly optimized BERT approach) is a variant of BERT that improves training logic, including more data and steps to enhance the robustness and precision of the model.
- ALBERT (A Lite BERT): A lightweight version of BERT that reduces the model size through parameter sharing and embedding matrix factorization, maintaining high performance with fewer parameters.
4.3. Evaluation Metrics
- Precision: Precision is the proportion of true positives among all the positive predictions. It measures the accuracy of the positive predictions made by the model.
- Recall: Recall is the proportion of true positives among all the actual positive data. It measures the model’s ability to capture all the positive samples.
- F1-Score: The F1-Score is obtained using both recall and precision. It provides a balanced measure considering both values, offering a single metric that reflects how well the model handled imbalanced data.
- Accuracy: Accuracy is the proportion of correct predictions among all the predictions made by the model.
- True Positive Rate (TPR): TPR is another term for recall. It represents the proportion of actual positives correctly identified by the model.
- False Positive Rate (FPR): FPR represents the proportion of actual negatives incorrectly classified as positives by the model.
4.4. Experiment Setup
4.4.1. Data Pre-Processing
- Text Filtering: Special characters, non-alphabetic values, and unnecessary symbols were removed. Additionally, the text was normalized by converting all characters to lowercase.
- Tokenization and Vectorization: The text was transformed using a two step process. First a Bag of Word (BoW) representation was created, where each value is converted into a fixed-length vector based on term frequencies in the vocabulary. After this step the terms frequencies were weigthed using TF-IDF(Term Frequency-Inverse Document Frequency). This technique assigns weights to words based on their frequency and relevance within the dataset, helping capture the importance of individual terms in the context of the entire corpus. The implementation uses scikit-learn’s TfidfVectorizer, which internally combines both BoW and TF-IDF transformations[31].
4.4.2. Traditional Machine Learning Model Parameters
-
Logistic Regression
- -
- Model: Logistic Regression (max_iter=1000)
- -
-
Hyperparameters:
- *
- Regularization Parameter (C): [0.1, 1, 10]
-
Random Forest
- -
- Model: Random Forest Classifier
- -
-
Hyperparameters:
- *
- Number of Estimators (n_estimators): [50, 100, 200]
- *
- Maximum Depth (max_depth): [None, 10, 20]
-
Support Vector Machine (SVM)
- -
- Model: SVC
- -
-
Hyperparameters:
- *
- Regularization Parameter (C): [0.1, 1, 10]
- *
- Kernel: [’linear’, ’rbf’]
-
Naive Bayes
- -
- Model: Multinomial Naive Bayes
- -
-
Hyperparameters:
- *
- Alpha (Smoothing Parameter): [0.5, 1.0, 1.5]
4.4.3. Transformer Model Parameters
-
Model Names:
- -
- distilBERT-base-uncased
- -
- bert-base-uncased
- -
- xlnet-base-cased
- -
- roBERTa-base
- -
- alBERT-base-v2
-
Training Settings:
- -
- Tokenizer: AutoTokenizer from Hugging Face’s transformers library.
- -
-
Dataset: EmailDataset class defined with:
- *
- Texts and labels from the dataset.
- *
- Tokenizer for encoding texts with special tokens, padding, and truncation.
- -
- Optimizer: AdamW optimizer with a learning rate of 2e-5.
- -
- Device: Utilizes CUDA if available, otherwise CPU.
- -
- Epochs: 3
-
Model Evaluation:
- -
- Batch Size: 16 for both training and testing DataLoader.
- -
- Loss Function: Cross-entropy loss.
- -
- Metrics: Classification report with precision, recall, F1-score, and support.
4.5. Results and Discussion
4.5.1. Traditional Machine Learning for Phishing Email Detection
| Model | Class | Precision | Recall | F1-Score | Accuracy |
|---|---|---|---|---|---|
| Logistic Regression 6 | Phishing | 0.9788 | 0.9850 | 0.9819 | 0.9845 |
| Not Phishing | 0.9888 | 0.9841 | 0.9864 | ||
| Random Forest | Phishing | 0.9752 | 0.9787 | 0.9769 | 0.9802 |
| Not Phishing | 0.9840 | 0.9813 | 0.9827 | ||
| Support Vector Machine | Phishing | 0.9820 | 0.9892 | 0.9856 | 0.9876 |
| Not Phishing | 0.9919 | 0.9864 | 0.9891 | ||
| Naive Bayes | Phishing | 0.9831 | 0.9329 | 0.9573 | 0.9644 |
| Not Phishing | 0.9516 | 0.9880 | 0.9695 |
4.5.2. Transformer-Based Machine Learning for Phishing Email Detection
| Model | Class | Precision | Recall | F1-Score | Accuracy |
|---|---|---|---|---|---|
| distilBERT-base-uncased | Phishing | 0.9933 | 0.9890 | 0.9911 | 0.9899 |
| Not Phishing | 0.9853 | 0.9911 | 0.9882 | ||
| bert-base-uncased | Phishing | 0.9947 | 0.9897 | 0.9922 | 0.9911 |
| Not Phishing | 0.9863 | 0.9929 | 0.9896 | ||
| xlnet-base-cased | Phishing | 0.9828 | 0.9971 | 0.9899 | 0.9884 |
| Not Phishing | 0.9961 | 0.9768 | 0.9863 | ||
| roBERTa-base | Phishing | 0.9928 | 0.9974 | 0.9951 | 0.9943 |
| Not Phishing | 0.9964 | 0.9903 | 0.9934 | ||
| alBERT-base-v2 | Phishing | 0.9939 | 0.9853 | 0.9896 | 0.9881 |
| Not Phishing | 0.9806 | 0.9919 | 0.9862 |
4.6. Analysis of Failed Predictions
4.6.1. Error Analysis for Traditional Model
4.6.2. Error Analysis for Transformer Models
5. Conclusion and Future Work
5.1. Conclusion
5.2. Future Work
Data Availability Statement
Appendix A. Datasets Analized in This Paper
- Enron Spam Data (2001). Marcel Wiechmann. Accessed on February 1, 2024. https://www.kaggle.com/datasets/marcelwiechmann/enron-spam-data
- Email-trainingdata-20k (2018). IBM. Accessed on February 1, 2024. https://github.com/IBM/nlc-email-phishing/blob/master/data/Email-trainingdata-20k.csv
- Pishing-2018 Monkey (2018). Jose. Accessed on February 1, 2024. https://monkey.org/~jose/phishing/
- Fraud Email Dataset (2018). Labhishek LL. Accessed on February 1, 2024. https://www.kaggle.com/datasets/llabhishekll/fraud-email-dataset
- Pishing-2019 Monkey (2019). Jose. Accessed on February 1, 2024. https://monkey.org/~jose/phishing/
- Pishing-2020 Monkey (2020). Jose. Accessed on February 1, 2024. https://monkey.org/~jose/phishing/
- Email Classification (2020). Taiwo Awe. Accessed on February 1, 2024. https://www.kaggle.com/datasets/taiwoawe/email-classification
- Spam Classification for Basic NLP (2020). Chandramouli Naidu. Accessed on February 1, 2024. https://www.kaggle.com/datasets/chandramoulinaidu/spam-classification-for-basic-nlp
- Pishing-2021 Monkey (2021). Jose. Accessed on February 1, 2024. https://monkey.org/~jose/phishing/
- Spam Email (2021). Rhitazajana. Accessed on September 15, 2024. https://www.kaggle.com/datasets/rhitazajana/spam-email
- Spam_assasin (2021). Ganiyu Olalekan. Accessed on February 1, 2024. https://www.kaggle.com/datasets/ganiyuolalekan/spam-assassin-email-classification-dataset
- Pishing-2022 Monkey (2022). Jose. Accessed on February 1, 2024. https://monkey.org/~jose/phishing/
- NLP Spam Ham Email Classification (2022). Yashpal Oswal. Accessed on February 1, 2024. https://www.kaggle.com/datasets/yashpaloswal/spamham-email-classification-nlp
- Phishing Email Data by Type (2022). Charlotte Hall. Accessed on February 1, 2024. https://www.kaggle.com/datasets/charlottehall/phishing-email-data-by-type
- Phishing-Mail (2023). Somu Mourya. Accessed on February 1, 2024. https://www.kaggle.com/datasets/somumourya/fishing-mail
- Pishing Email Detection (2023). Subha Journal. Accessed on February 1, 2024. https://www.kaggle.com/datasets/subhajournal/phishingemails
- Email Spam Classification (2023). Tapakah68. Accessed on February 1, 2024. https://www.kaggle.com/datasets/tapakah68/email-spam-classification
- Phishing validation emails dataset (2024). R. Miltchev, D. Rangelov, G. Evgeni. Accessed in August 2024. https://doi.org/10.5281/zenodo.13474745
- Email Spam Detection Dataset (classification) (2023). Shantanudhakadd. Accessed on February 1, 2024. https://www.kaggle.com/datasets/shantanudhakadd/email-spam-detection-dataset-classification
- private-phishing4.mbox (2023). Jose. Accessed on February 1, 2024. https://monkey.org/~jose/phishing/
- Phishing Email Data (2023). Tanusree Sharma. Accessed on February 1, 2024. https://github.com/TanusreeSharma/phishingdata-Analysis/blob/master/1st%20data/PhishingEmailData.csv
- selfpromoted phishing data (2023). David Svy. Accessed on February 1, 2024. https://github.com/davidsvy/Neural-Scam-Artist?tab=readme-ov-file
References
- Laudon, K.; Traver, C. E-commerce 2023: Business, Technology, Society; Pearson, 2023.
- Cellucci, N.; Moore, T.; Salaky, K. How Much Does Internet Cost Per Month?, 2024. Accessed: 2024-9-15.
- Al-Mashhadi, H.M.; Alabiech, M.H. A survey of email service; attacks, security methods and protocols. International Journal of Computer Applications 2017, 162.
- Tariq, U.; Ahmed, I.; Bashir, A.K.; Shaukat, K. A Critical Cybersecurity Analysis and Future Research Directions for the Internet of Things: A Comprehensive Review. Sensors 2023, 23. [CrossRef]
- Internet Crime Complaint Center (IC3). 2023 Internet Crime Report. https://www.ic3.gov/AnnualReport/Reports/2023_IC3Report.pdf, 2023. Accessed: 2024-08-25.
- Brody, R.G.; Mulig, E.; Kimball, V. PHISHING, PHARMING AND IDENTITY THEFT. Academy of Accounting & Financial Studies Journal 2007, 11.
- Verizon. 2023 Data Breach Investigations Report. https://www.verizon.com/about/news/2023-data-breach-investigations-report, 2023. Accessed: 2024-08-25.
- Altulaihan, E.; Alismail, A.; Hafizur Rahman, M.M.; Ibrahim, A.A. Email Security Issues, Tools, and Techniques Used in Investigation. Sustainability 2023, 15. [CrossRef]
- Google Developers. Authentication overview. https://developers.google.com/workspace/guides/auth-overview, n.d. Accessed: 2024-08-25.
- Anti-Phishing Working Group (APWG). Phishing Activity Trends Report, Q1 2022. https://docs.apwg.org/reports/apwg_trends_report_q1_2022.pdf, 2022. Accessed: 2024-08-25.
- Naqvi, B.; Perova, K.; Farooq, A.; Makhdoom, I.; Oyedeji, S.; Porras, J. Mitigation strategies against the phishing attacks: A systematic literature review. Computers & Security 2023, 132, 103387. https://. [CrossRef]
- Patel, N. Social engineering as an evolutionary threat to information security in healthcare organizations. Jurnal Administrasi Kesehatan Indonesia Volume 2020, 8.
- Chanti, S.; Chithralekha, T. A literature review on classification of phishing attacks. International Journal of Advanced Technology and Engineering Exploration 2022, 9, 446–476. [CrossRef]
- Kumar, N.S. Phishing Email Detection Using CNN.
- Alam, M.N.; Sarma, D.; Lima, F.F.; Saha, I.; Ulfath, R.E.; Hossain, S. Phishing Attacks Detection using Machine Learning Approach. 2020 Third International Conference on Smart Systems and Inventive Technology (ICSSIT), 2020, pp. 1173–1179. [CrossRef]
- Milletary, J.; Center, C.C. Technical trends in phishing attacks. Retrieved December 2005, 1, 3–3.
- Hera, J. Phishing Defense Mechanisms: Strategies for Effective Measurement and Cyber Threat Mitigation 2024.
- Tang, L.; Mahmoud, Q.H. A survey of machine learning-based solutions for phishing website detection. Machine Learning and Knowledge Extraction 2021, 3, 672–694. [CrossRef]
- Samad, A.S.; Balasubaramanian, S.; Al-Kaabi, A.S.; Sharma, B.; Chowdhury, S.; Mehbodniya, A.; Webber, J.L.; Bostani, A. Analysis of the performance impact of fine-tuned machine learning model for phishing URL detection. Electronics 2023, 12. [CrossRef]
- Agrawal, G.; Kaur, A.; Myneni, S. A review of generative models in generating synthetic attack data for cybersecurity. Electronics 2024, 13. [CrossRef]
- Roumeliotis, K.I.; Tselikas, N.D.; Nasiopoulos, D.K. Next-generation spam filtering: Comparative fine-tuning of LLMs, NLPs, and CNN models for email spam classification. Electronics 2024, 13. [CrossRef]
- Salloum, S.; Gaber, T.; Vadera, S.; Shaalan, K. A systematic literature review on phishing email detection using natural language processing techniques. IEEE Access 2022, 10, 65703–65727. [CrossRef]
- Atawneh, S.; Aljehani, H. Phishing Email Detection Model Using Deep Learning. Electronics 2023, 12. [CrossRef]
- Newaz, I.; Jamal, M.K.; Hasan Juhas, F.; Patwary, M.J.A. A Hybrid Classification Technique using Belief Rule Based Semi-Supervised Learning. 2022 25th International Conference on Computer and Information Technology (ICCIT), 2022, pp. 466–471. [CrossRef]
- Jamal, K.; Hossain, M.A.; Mamun, N.A. Improving Phishing and Spam Detection with DistilBERT and RoBERTa. arXiv preprint 2023, [2311.04913]. [CrossRef]
- Lee, Y.; Saxe, J.; Harang, R. CATBERT: Context-Aware Tiny BERT for Detecting Social Engineering Emails, 2020, [arXiv:cs.CR/2010.03484].
- Bishop, C.M. Pattern Recognition and Machine Learning; Springer: New York, 2006.
- Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2 ed.; Springer: New York, 2009.
- Batutin, A. Choose Your AI Weapon: Deep Learning or Traditional Machine Learning? https://shelf.io/blog/choose-your-ai-weapon-deep-learning-or-traditional-machine learning, 2023. Accessed: 2024-09-19.
- Amatriain, X.; Sankar, A.; Bing, J.; Bodigutla, P.K.; Hazen, T.J.; Kazi, M. Transformer models: an introduction and catalog, 2024, [arXiv:cs.CL/2302.07730].
- learn Developers, S. TfidfVectorizer, 2024. Accessed: 2024-11-21.
- Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; Davison, J.; Shleifer, S.; von Platen, P.; Ma, C.; Jernite, Y.; Plu, J.; Xu, C.; Le Scao, T.; Gugger, S.; Drame, M.; Lhoest, Q.; Rush, A. Transformers: State-of-the-Art Natural Language Processing. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations; Liu, Q.; Schlangen, D., Eds.; Association for Computational Linguistics: Online, 2020; pp. 38–45. [CrossRef]





| Specifications | Statistics |
|---|---|
| Number of Emails | 119,148 |
| Number of Words | 30,478,934 |
| Average Words per Email | 255.81 |
| Average Words per Sentence | 15.53 |
| Highest Word Count in an Email | 15,828 |
| Lowest Word Count in an Email | 10 |
| Number of Phishing Emails | 68,912 |
| Number of Non-Phishing Emails | 50,236 |
| Text | Actual Label | Predicted Label | Model |
|---|---|---|---|
| K do I need a login or anything | not phishing | phishing | Naive Bayes |
| But when I try to reply, the reply mail is: laura_samuel@aol.com I tried checking on google "terrativa.com.br", it gives a result a webpage with cover only: http://www.terrativa.com.br Sure, it’s scam | phishing | not phishing | Random Forest |
| You are a winner of N450,000 ur phone no is among the<20> lucky winners of (LACASERA drink promo) code no call Pastor JOHN ON: 08167059152 for claims. | phishing | not phishing | Support Vector Machine |
| Text | Actual Label | Predicted Label | Model |
|---|---|---|---|
| get a university diploma in just 7 days webmaster so you have piles of degrees but you ’ re missing just one and it just so happens to be the one you really need badly in order to get the better job . if this sounds like you the read this : http : / / www . hovad . info / 4398 . html the other link http : / / hovad . info / toloshoka . html | phishing | not phishing | roBERTa |
| Anyone can cook - with fresh news from Kitchen Stories. | Doesn’t look right? Just click here! Kitchen Stories Recipes Stories Categories How-Tos Keema Curry Hello there! Ever since I started working with her, I have always been in awe of Ruby’s curry recipes. They’re fun, super tasty, and easy | not phishing | phishing | distilBERT |
| this is a generated email - do not reply ! if you need further assistance , contact the isc help desk at : 713 - 345 - 4727 the password for your account : po 0507544 has been reset to : 14031399 | not phishing | phishing | distilBERT |
| i have gone through your advertisement with the pics and i am satisfied with it.. I will be glad if you can mail me the present condition with the full price as well. As for the payment..i will be paying you via the fastest and secure way to pay online (PayPal). I have a private courier agent that will come for the pick up after the payment has been made ... so no shipping included. My private courier agent will come for the pick up and sign all necessary documents on my behalf after the payment has been made, as they will also be coming with all the information needed, my details and transferring the name of ownership to me will be done by the pick-up agent so you don’t have to worry about that. You can now send me your PayPal email so I can pay in right away and also include your address in your reply. If you don’t have a PayPal account, you can easily set up one... log on to www.paypal.com and sign up. It’s very easy. I await your reply asap. Thank you, Steve | phishing | not phishing | roBERTa |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).