Submitted:
13 May 2025
Posted:
14 May 2025
You are already at the latest version
Abstract
Keywords:
I. Introduction
- Comparative Evaluation: We benchmark supervised and unsupervised models using precision, recall, F1-score, and AUC-ROC.
- Insight into Trade-offs: We analyze model performance, scalability, and interpretability, identifying when each method is most appropriate.
- Toward Hybrid Methods: We propose a hybrid perspective that leverages the strengths of both paradigms to address real-world constraints such as limited labels and evolving fraud tactics.
II. Methods
A. Dataset and Preprocessing
B. Supervised Learning Models
C. Unsupervised Learning Models
D. Evaluation Metrics
III. Results
A. Performance of Supervised Learing
C. Model Interpretation and Feature Importance
IV. Discussion
A. Interpretation of Findings
B. Practical Implications
- Logistic Regression still remains useful in regulated environments where model transparency is paramount. However, its performance degrades significantly when fraud patterns are complex or when feature scaling is not carefully managed.
- Tree-based models, particularly LightGBM, offer a strong balance of accuracy, scalability, and interpretability. LightGBM’s fast training and robust performance under class imbalance make it a strong candidate for integration into real-time fraud detection systems.
C. Limitations and Future Works
- The dataset used is synthetic and publicly available, which may limit the generalizability to real-world transaction environments.
- Although we included an unsupervised baseline (K-Means), the evaluation was limited to one clustering algorithm. Future work should explore advanced unsupervised or hybrid models, such as Isolation Forest, autoencoders, or semi-supervised anomaly detection.
- Our model evaluation did not include cost-sensitive metrics or business-driven thresholds, which are often crucial in real-world fraud detection and it could be a future direction.
- Lastly, the models evaluated were static. In practice, fraud patterns evolve rapidly. Online learning or continual learning frameworks should be investigated to maintain long-term model effectiveness.
References
- Rao, R.K.; Mandhala, V.N. Unveiling Financial Fraud: A Comprehensive Review of Machine Learning and Data Mining Techniques. Intelligent Systems and Informatics, 2024, 29, 2309–2334. [Google Scholar] [CrossRef]
- P. Li, M. P. Li, M. Abouelenien, R. Mihalcea, Z. Ding, Q. Yang, and Y. Zhou, "Deception detection from linguistic and physiological data streams using bimodal convolutional neural networks," in Proc. 2024 5th Int. Conf. Inf. Sci., Parallel Distrib. Syst. (ISPDS), 2024, pp.
- Y. Liu, X. Y. Liu, X. Shen, Y. Zhang, Z. Wang, Y. Tian, J. Dai, and Y. M: Cao, “A systematic review of machine learning approaches for detecting deceptive activities on social media; arXiv:2410.20293, 2024.
- D. W. Hosmer and S. Lemeshow, Applied Logistic Regression, 2nd ed., New York, NY: John Wiley & Sons, Inc., 2000.
- Y. Zhang, Z. Y. Zhang, Z. Wang, Z. Ding, Y. Tian, J. Dai, X. Shen, Y. Liu, and Y. arXiv preprint, arXiv:2502.04342, 2025.
- L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
- Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
- G. Ke, Q. G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu, “LightGBM: A highly efficient gradient boosting decision tree,” in Proc. 31st Int. Conf. Neural Information Processing Systems (NeurIPS), pp. 3149–3157, 2017.
- J. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proc. 5th Berkeley Symp. on Mathematical Statistics and Probability, vol. 1, pp. 281–297, 1967.
- T. Hastie, R. T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., New York, NY: Springer, 2009.
- D. M. W. Powers, “Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. Journal of Machine Learning Technologies 2011, 2, 37–63.
- Davis, J.; Goadrich, M. The Relationship Between Precision-Recall and ROC Curves. In Proceedings of the 23rd International Conference on Machine Learning, ACM, Pittsburgh, PA, USA, 25–29 June 2006. [Google Scholar] [CrossRef]
- Z. Ding, Z. Z. Ding, Z. Wang, Y. Zhang, Y. Cao, Y. Liu, X. Shen, Y. Tian, and J. Dai, “Efficient or powerful? arXiv preprint, arXiv:2503.01082, 2025.
- Y. Cao, J. Dai, Z. Wang, Y. Zhang, X. Shen, Y. Liu, and Y. Tian, “Machine learning approaches for depression detection on social media: A systematic review of biases and methodological challenges. Journal of Behavioral Data Science 2025, 5. [Google Scholar]
- Y. Tao, Y. Y. Tao, Y. Shen, H. Zhang, Y. Shen, L. Wang, C. Shi, and S. arXiv preprint, arXiv:2412.17011, 2024.
- Y. Shen, L. Y. Shen, L. Wang, C. Shi, S. Du, Y. Tao, Y. Shen, and H. arXiv preprint, arXiv:2412.20061, 2025.
- Y. Shen, H. Y. Shen, H. Zhang, Y. Shen, L. Wang, C. Shi, S. Du, and Y. arXiv preprint, arXiv:2501.00113, 2024.
| Model | Feature Handling | Macro Average | AUC | ||||
| Precision | Recall | F1-Score | |||||
| Logistic Regression | Raw | 1.00 | 0.96 | 0.84 | 0.76 | ||
| Standardized | 1.00 | 0.52 | 0.55 | 0.52 | |||
| Random Forest | Raw | 0.97 | 0.83 | 0.89 | 0.83 | ||
| Standardized | 0.94 | 0.86 | 0.89 | 0.86 | |||
| LightGBM | Raw | 0.94 | 0.86 | 0.89 | 0.86 | ||
| Standardized | 1.00 | 0.83 | 0.90 | 0.83 | |||
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).