Submitted:
13 October 2025
Posted:
15 October 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Datasets
2.1. Credit Card Fraud Detection Dataset (Kaggle, 2023)
2.2. Bank Account Fraud Dataset (NeurIPS, 2022)
3. Methodology
3.1. Preprocessing
3.2. Model Development
- Logistic Regression – interpretable linear baseline model.
- Random Forest – an ensemble of decision trees capturing nonlinear relationships.
- XGBoost – gradient-boosted decision trees optimized for imbalanced data.
3.3. Evaluation Metrics
- F1-score: Harmonic mean of precision and recall.
- AUC-ROC: Area under the receiver operating characteristic curve.
4. Results
4.1. Credit Card Dataset (Balanced 50:50)
| Model | F1 | AUC-ROC | CV | Stability |
| Logistic Regression | 99.84% | 99.98% | 0.018% | 99.98% |
| Random Forest | 99.97% | 99.99% | 0.003% | 99.99% |
| XGBoost | 99.98% | 99.99% | 0.005% | 99.99% |
4.2. Bank Account Dataset (Imbalanced ≈1.1% Fraud)
| Model | F1 | AUC-ROC | CV | Stability |
| Logistic Regression | 20.36% | 86.95% | 1.76% | 98.9% |
| Random Forest | 18.21% | 83.02% | 11.82% | 89.5% |
| XGBoost | 23.41% | 89.34% | 3.01% | 98.3% |
4.3. Comparative Summary
| Metric | Credit Card | Bank Fraud | Difference |
| Mean F1 (XGBoost) | 99.97% | 23.41% | ↓76.56% |
| Mean AUC (XGBoost) | 99.99% | 89.34% | ↓10.65% |
5. Discussion
6. Conclusions and Future Work
- Adaptive Resampling and Dynamic Thresholds, whereby sample fractions and cut off points will be changed concerning the evolving distribution of fraud.
- Cost Sensitivity and Focal Loss Training will improve recall on rare fraud classes.
- Temporal and Sequential Modelling (LSTM, Transformers) will resolve to model dynamic transaction structures.
- Explainable AI methods (SHAP, LIME) so sectors are transparent and confident.
- Semi-supervised and On-Line Learning whereby the model will be capable of changes/augmentations with sensitivity to technology applications.
- Two Tier Benchmarks which actually condition the models on both ideal and realistic data to ensure meaningful application.
- Implementation of the aforementioned avenues will help to dampen the existing gulf between experimental results and actual application to ensure enhancement of actual application ensuring technological robustness and business benefit for the counterskills in fraud prevention.
Use of AI Tools
Code Availability
Data Availability
Conflicts of Interest
References
- Breiman, L. Random forests. Machine Learning 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD Conference 2016, 785–794. [Google Scholar]
- Gupta, A.; Kumar, N.; Yadav, S. Evaluating machine-learning models for financial fraud detection under severe class imbalance. Expert Systems with Applications 2023, 228, 120579. [Google Scholar] [CrossRef]
- Ryman, A.; Lee, D. Real-world fraud detection benchmarks with extreme imbalance: A comparative study. IEEE Access 2022, 10, 62493–62508. [Google Scholar] [CrossRef]
- Singh, P.; Raj, R.; Chatterjee, S. Understanding performance limits of supervised fraud detection under class imbalance. Information Sciences 2024, 658, 120992. [Google Scholar] [CrossRef]
- Kulatilleke, S.; Samarakoon, M. Empirical study of machine learning classifier evaluation metrics behavior in massively imbalanced and noisy data. arXiv 2022, arXiv:2208.11904, 11904. [Google Scholar] [CrossRef]
- Popova, E.; Dubrova, A.; Thalheim, L. Credit card fraud detection: Model evaluation under class imbalance. arXiv 2025, arXiv:2509.15044. https://arxiv.org/pdf/2509, 15044. [Google Scholar]
- Credit Card Fraud Detection Dataset (Kaggle, 2023). 2023. Available online: https://www.kaggle.com/datasets/nelgiriyewithana/credit-card-fraud-detection-dataset-2023.
- Bank Account Fraud Dataset (NeurIPS 2022). 2022. Available online: https://www.kaggle.com/datasets/sgpjesus/bank-account-fraud-dataset-neurips-2022.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).