Submitted:
13 June 2026
Posted:
17 June 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Work
3. Dataset

4. Methods
4.1. Preprocessing Techniques
4.2. Machine Learning Models
4.3. Hyperparameter Selection
4.4. Baseline Classifiers
4.5. From-Scratch Implementations
4.6. Experimental Pipeline
5. Results
5.1. Hyperparameter Tuning
5.2. Cross-Validation Performance

5.3. Statistical Significance
5.4. Scale-Distortion Experiment

5.5. Feature Importance Analysis

6. Discussion
7. Conclusions
Appendix A. Full Cross-Validation Results
| Prep. | Model | ROC-AUC | Bal. Acc. | F1 |
|---|---|---|---|---|
| std | LogReg | 1.0000 | .9999 | .9999 |
| minmax | SVM-Lin | 1.0000 | .9999 | .9999 |
| std | SVM-RBF | 1.0000 | .9999 | .9999 |
| minmax | SVM-RBF | 1.0000 | .9998 | .9998 |
| minmax | LogReg | 1.0000 | .9998 | .9998 |
| pca | SVM-RBF | .9999 | .9995 | .9996 |
| whiten | LogReg | .9999 | .9995 | .9996 |
| pca | LogReg | .9999 | .9995 | .9996 |
| none | NB | .9999 | .9997 | .9998 |
| pca | SVM-Lin | .9999 | .9998 | .9998 |
| none | LogReg | .9999 | .9996 | .9997 |
| whiten | SVM-RBF | .9999 | .9994 | .9995 |
| whiten | SVM-Lin | .9999 | .9998 | .9998 |
| std | NB | .9999 | .9998 | .9998 |
| minmax | NB | .9999 | .9997 | .9997 |
| std | SVM-Lin | .9999 | .9999 | .9999 |
| std | k-NN | .9999 | .9985 | .9988 |
| none | SVM-Lin | .9999 | .9959 | .9963 |
| pca | k-NN | .9997 | .9957 | .9966 |
| none | k-NN | .9995 | .9946 | .9956 |
| whiten | k-NN | .9994 | .9928 | .9943 |
| minmax | k-NN | .9994 | .9949 | .9959 |
| none | SVM-RBF | .9970 | .9653 | .9748 |
| whiten | NB | .9887 | .9696 | .9728 |
| pca | NB | .9887 | .9696 | .9728 |
Appendix B. Additional Figures






Appendix C. Implementation Notes
- Min-max scaler: max absolute difference .
- Standard scaler: max absolute difference .
- k-NN (): 100% prediction agreement on 500 test instances.
References
- Ahsan, M. M., M. P. Mahmud, P. K. Saha, K. D. Guber, and Z. Siddique. 2021. Effect of data scaling methods on machine learning algorithms and model performance. Technologies 9, 3: 52. [Google Scholar] [CrossRef]
- Bergstra, J., and Y. Bengio. 2012. Random search for hyper-parameter optimization. Journal of Machine Learning Research 13: 281–305. [Google Scholar]
- Bishop, C. M. 2006. Pattern Recognition and Machine Learning. Springer. [Google Scholar]
- Cortes, C., and V. Vapnik. 1995. Support-vector networks. Machine Learning 20, 3: 273–297. [Google Scholar] [CrossRef]
- Cover, T., and P. Hart. 1967. Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13, 1: 21–27. [Google Scholar] [CrossRef]
- Domingos, P., and M. Pazzani. 1997. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning 29, 2–3: 103–130. [Google Scholar] [CrossRef]
- Fernandes, B., H. Calisto, H. Pinto, M. Vieira, J. M. Fernandes, and M. Rocha. 2022. The choice of scaling technique matters for classification performance. Applied Soft Computing 133: 109924. [Google Scholar] [CrossRef]
- Hastie, T., R. Tibshirani, and J. Friedman. 2009. The Elements of Statistical Learning, 2nd ed. Springer. [Google Scholar] [CrossRef]
- Hsu, C.-W., C.-C. Chang, and C.-J. Lin. 2003. A practical guide to support vector classification. Technical report. Department of Computer Science, National Taiwan University. [Google Scholar]
- Jolliffe, I. T. 2002. Principal Component Analysis, 2nd ed. Springer. [Google Scholar] [CrossRef]
- Kohavi, R. 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection. Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI); pp. 1137–1143. [Google Scholar]
- Pedregosa, F., and et al. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12: 2825–2830. [Google Scholar]
- Prasad, A., and S. Chandra. 2024. PhiUSIIL: A diverse security profile empowered phishing URL detection framework based on similarity index and incremental learning. Computers & Security 136: 103545. [Google Scholar] [CrossRef]
- Singh, D., and B. Singh. 2020. Investigating the impact of data normalization on classification performance. Applied Soft Computing 97: 105524. [Google Scholar] [CrossRef]
| Prep. | Model | ROC-AUC | Bal. Acc. | F1 |
|---|---|---|---|---|
| none | LogReg | .9999 | .9996 | .9997 |
| std | LogReg | 1.000 | .9999 | .9999 |
| none | k-NN | .9995 | .9946 | .9956 |
| std | k-NN | .9999 | .9985 | .9988 |
| none | SVM-RBF | .9970 | .9653 | .9748 |
| std | SVM-RBF | 1.000 | .9999 | .9999 |
| none | NB | .9999 | .9997 | .9998 |
| pca | NB | .9887 | .9696 | .9728 |
| Model | Comparison | AUC | p-value |
|---|---|---|---|
| LogReg | none vs std | +0.000 | 0.119 |
| k-NN | none vs std | +0.0005 | 0.003* |
| SVM-RBF | none vs std | +0.003 | <0.001* |
| SVM-RBF | none vs minmax | +0.003 | <0.001* |
| NB | none vs pca | −0.011 | <0.001* |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).