1. Introduction
Fraud detection is a crucial field of financial analytics, where machine learning algorithms are used to identify suspicious activity and prevent losses. There are still significant challenges despite the progress in predictive modelling through advanced methods, related to data imbalance and a variety of specific fraud patterns. This study investigates how three machine learning algorithms, Logistic Regression, Random Forest, and XGBoost, behave over two datasets with significantly different statistical distributions. The preprocessing methodologies, sampling techniques, and evaluations are kept constant in order to study the effect that the structure of the datasets has on the behavior and stability of the models.
2. Datasets
2.1. Credit Card Fraud Detection Dataset (Kaggle, 2023)
This dataset consists of credit card transactions made by European credit cardholders in 2023. It contains over 550,000 records. Each record of transaction contains a unique identifier variable (id), a set of twenty-eight anonymized numerical variables (V1–V28) which give certain characteristics of the transaction (time, location, etc.) as well as other variables derived from them, the Amount variable giving the value of the transaction made, and a binary target variable Class which identifies whether or not the transaction was fraudulent (1) or non-fraudulent (0). In this investigation, a synthetic, balanced dataset composed of equal numbers of fraudulent and non-fraudulent records was used to assess the performance of models under idealized experimental conditions.
2.2. Bank Account Fraud Dataset (NeurIPS, 2022)
The NeurIPS 2022 dataset simulates fraudulent bank account openings using demographic and behavioral variables (e.g., age, income, failed_logins). Fraudulent accounts make up about 1.1% of all records, reflecting a highly imbalanced real-world scenario.
3. Methodology
3.1. Preprocessing
All features were numeric; thus, one-hot encoding was unnecessary. Data were standardized and split into training (60%), validation (20%), and test (20%) subsets using stratified sampling. SMOTE was applied only to the training portion of the Bank Fraud dataset to address imbalance while avoiding data leakage. (How Smote Oversampling Can Cause Data Leakage In Cross-Validation, n.d.)
3.2. Model Development
Three models were trained under identical conditions:
Logistic Regression – interpretable linear baseline model.
Random Forest – an ensemble of decision trees capturing nonlinear relationships.
XGBoost – gradient-boosted decision trees optimized for imbalanced data.
Hyperparameters were tuned using the Optuna package with validation F1-score as the optimization target. Experiments were repeated under six random seeds to assess variance and model stability.
3.3. Evaluation Metrics
Performance was assessed using:
4. Results
4.1. Credit Card Dataset (Balanced 50:50)
| Model |
F1 |
AUC-ROC |
CV |
Stability |
| Logistic Regression |
99.84% |
99.98% |
0.018% |
99.98% |
| Random Forest |
99.97% |
99.99% |
0.003% |
99.99% |
| XGBoost |
99.98% |
99.99% |
0.005% |
99.99% |
All models performed nearly perfectly, demonstrating that when data are balanced and well-structured, even simpler algorithms can achieve excellent performance.
4.2. Bank Account Dataset (Imbalanced ≈1.1% Fraud)
| Model |
F1 |
AUC-ROC |
CV |
Stability |
| Logistic Regression |
20.36% |
86.95% |
1.76% |
98.9% |
| Random Forest |
18.21% |
83.02% |
11.82% |
89.5% |
| XGBoost |
23.41% |
89.34% |
3.01% |
98.3% |
The sharp performance drop illustrates the complexity of detecting fraud in imbalanced data. Nevertheless, F1-scores between 15–25% align with real-world benchmarks for operational fraud detection systems (Gupta et al., 2023; Ryman & Lee, 2022; Singh et al., 2024).
4.3. Comparative Summary
| Metric |
Credit Card |
Bank Fraud |
Difference |
| Mean F1 (XGBoost) |
99.97% |
23.41% |
↓76.56% |
| Mean AUC (XGBoost) |
99.99% |
89.34% |
↓10.65% |
These results confirm that class imbalance and weak signal-to-noise ratios in behavioral data are the primary drivers of degraded model accuracy and recall.
5. Discussion
This study shows that the nature of the data rather than the complexity of the models is most critical in the detection of fraudulent actions that involve money. The balanced data set produced a near perfect separation between all algorithms, whereas the great imbalance produced realistic but much lower F1-scores. The Random Forest model showed a large variance in the accuracy of said data, whereas the Logistic model, while constant and stable in behavior, had a limited recall. The XGBoost model gave the best balance of high precision and good recall validating the claim of the noise resistivity of these models. It should be noted that these “low” scores in the case of the imbalanced data should not be felt to indicate a poor performance of the statistical model employed. In reality the exposure of fraudulent transactions means that these operations would compose invariably less than 1% of the total records, whereby the inevitable lower F1-scores thus produced. For example, as divulged by Kulatilleke and Samarakoon (2022), the evaluation metrics F1 and precision, in a state of massive data imbalance and noisy data, would rapidly deteriorate where mass processing would be effective. (Kulatilleke et al., 2022) With Popova et al. (2025) also pointing to legitimate F1 scores between 15–25% being high-grade real-world operational scores for such models. (Alshameri et al., 2023) Provided, of course, the models yield some degree of increase in valid overall recall and precision over those of random and rule-based operational systems. The model of operation, where fraud detection was involved, was not to achieve full accuracy, but satisfactory detectability, and served well to achieve a high final recall percentage. Thus, making successful detection without significant loss of the machine time resources in processing millions of records, constant false alarms, and so many millions of wasted machine hours in management and summing of said statistics. The actual modelling fulfills these obligations in getting to let us see that the academic and operational bodies both have accepted in their research that in these data conditions, is needed the virtues of interpretability, generalizability, and instance stability rather than absolute F1 scores.
6. Conclusions and Future Work
This research illustrates that data structure, rather than total algorithm complexity, is the driving problem for success in financial fraud detection. Although idealized balanced synthetic datasets produce nearly perfect results, realistic of these are rare. The very dramatic fall in F1 from a nearly 100% score in the balanced Credit Card dataset to a mere 23% in the imbalanced Bank Fraud dataset illustrates clearly how data imbalance, feature realism, and class noise restrict the performance limits of models.
Based on the empirical studies made in these situations, the paper recommends a two-tier benchmark framework for future fraud work: (1) idealized datasets for tuning algorithms, and (2) realistic datasets for operation in real systems. In addition, it would change the meaning of success in analytics, as it has been shown that moderate F1 performance is nevertheless consistent with high importance in business value if realistic class imbalance is assumed.
In future work, it is suggested that this area might be extended by looking at adaptive resampling, dynamic thresholds, and temporal pattern learning as methods of bridging the gap much more closely between laboratory performance and results in live deployment.
To bridge the gap between lab performance and actual performance, future directions should involve:
Adaptive Resampling and Dynamic Thresholds, whereby sample fractions and cut off points will be changed concerning the evolving distribution of fraud.
Cost Sensitivity and Focal Loss Training will improve recall on rare fraud classes.
Temporal and Sequential Modelling (LSTM, Transformers) will resolve to model dynamic transaction structures.
Explainable AI methods (SHAP, LIME) so sectors are transparent and confident.
Semi-supervised and On-Line Learning whereby the model will be capable of changes/augmentations with sensitivity to technology applications.
Two Tier Benchmarks which actually condition the models on both ideal and realistic data to ensure meaningful application.
Implementation of the aforementioned avenues will help to dampen the existing gulf between experimental results and actual application to ensure enhancement of actual application ensuring technological robustness and business benefit for the counterskills in fraud prevention.
Code Availability
The Python scripts implementing data preprocessing, model training, and evaluation (including SMOTE balancing and F1 optimization pipeline) are available as supplementary material. The code can be freely reused under an open academic license for educational and research purposes.
Conflicts of Interest
The author declares no conflicts of interest.
References
- Breiman, L. Random forests. Machine Learning 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD Conference 2016, 785–794. [Google Scholar]
- Gupta, A.; Kumar, N.; Yadav, S. Evaluating machine-learning models for financial fraud detection under severe class imbalance. Expert Systems with Applications 2023, 228, 120579. [Google Scholar] [CrossRef]
- Ryman, A.; Lee, D. Real-world fraud detection benchmarks with extreme imbalance: A comparative study. IEEE Access 2022, 10, 62493–62508. [Google Scholar] [CrossRef]
- Singh, P.; Raj, R.; Chatterjee, S. Understanding performance limits of supervised fraud detection under class imbalance. Information Sciences 2024, 658, 120992. [Google Scholar] [CrossRef]
- Kulatilleke, S.; Samarakoon, M. Empirical study of machine learning classifier evaluation metrics behavior in massively imbalanced and noisy data. arXiv 2022, arXiv:2208.11904, 11904. [Google Scholar] [CrossRef]
- Popova, E.; Dubrova, A.; Thalheim, L. Credit card fraud detection: Model evaluation under class imbalance. arXiv 2025, arXiv:2509.15044. https://arxiv.org/pdf/2509, 15044. [Google Scholar]
- Credit Card Fraud Detection Dataset (Kaggle, 2023). 2023. Available online: https://www.kaggle.com/datasets/nelgiriyewithana/credit-card-fraud-detection-dataset-2023.
- Bank Account Fraud Dataset (NeurIPS 2022). 2022. Available online: https://www.kaggle.com/datasets/sgpjesus/bank-account-fraud-dataset-neurips-2022.
|
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).