Submitted:
31 August 2025
Posted:
02 September 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- Comprehensive Evaluation Framework: We establish a systematic evaluation protocol for comparing ensemble methods across multiple performance dimensions including accuracy, computational efficiency, and interpretability.
- Performance Benchmarking: We provide detailed performance analysis of six major ensemble categories: tree-based methods, gradient boosting approaches, stacking ensembles, voting classifiers, neural ensemble methods, and hybrid approaches.
- Feature Engineering Analysis: We investigate the impact of various feature engineering strategies specifically optimized for ensemble methods in fraud detection contexts.
- Practical Deployment Guidelines: We offer evidence-based recommendations for selecting appropriate ensemble methods based on specific deployment requirements and constraints.
2. Related Work
2.1. Ensemble Learning in Fraud Detection
2.2. Class Imbalance Handling Techniques
2.3. Deep Learning and Advanced Approaches
2.4. Explainable AI Integration
3. Methodology
3.1. Ensemble Learning Framework
- Tree-based Ensembles: Random Forest and Extra Trees methods that combine multiple decision trees with different randomization strategies.
- Gradient Boosting Methods: XGBoost, LightGBM, and CatBoost algorithms that sequentially build models to correct previous predictions.
- Stacking Ensembles: Multi-level learning approaches that use meta-learners to combine base model predictions.
- Voting Classifiers: Hard and soft voting approaches that combine predictions through majority voting or probability averaging.
- Neural Ensemble Methods: Multiple neural network architectures combined through various aggregation strategies.
- Hybrid Approaches: Methods that combine traditional machine learning with deep learning components.
3.2. Data Preprocessing Pipeline
3.2.1. Feature Engineering
3.2.2. Class Imbalance Handling
3.2.3. Feature Selection and Dimensionality Reduction
3.2.4. Data Quality Assurance
3.3. Evaluation Metrics and Protocols
- Area Under ROC Curve (AUC-ROC): Primary metric for overall discrimination capability
- Area Under Precision-Recall Curve (AUC-PR): Critical metric for imbalanced datasets
- F1-Score: Harmonic mean of precision and recall
- Balanced Accuracy: Average of sensitivity and specificity
- Matthews Correlation Coefficient (MCC): Correlation between predictions and true labels
- Cost-Sensitive Metrics: Business-relevant metrics incorporating misclassification costs
4. Experimental Setup
4.1. Datasets
4.1.1. Primary Dataset: IEEE-CIS
4.1.2. Secondary Datasets
4.2. Experimental Design
4.2.1. Hyperparameter Optimization
4.2.2. Statistical Significance Testing
5. Results and Discussion
5.1. Overall Performance Comparison
5.2. Computational Efficiency Analysis
5.3. Class Imbalance Handling Effectiveness
5.4. Feature Engineering Impact Analysis
5.5. Cross-Dataset Generalization Analysis
5.6. Interpretability and Explainability Analysis
5.7. Real-Time Deployment Considerations
5.8. Statistical Significance Analysis
6. Discussion
6.1. Key Findings
6.2. Practical Implications
6.2.1. Method Selection Guidelines
- Maximum Performance Requirements: Stacking ensembles for scenarios where accuracy is paramount and computational resources are abundant.
- Real-time Systems: LightGBM or Random Forest for high-throughput environments requiring sub-second response times.
- Interpretability Requirements: Random Forest for regulatory environments requiring model transparency and explainability, with post-hoc explanation techniques [16] for enhanced interpretability.
- Balanced Requirements: XGBoost for general-purpose applications requiring good performance with reasonable computational overhead.
6.2.2. Implementation Considerations
6.3. Limitations and Future Work
6.3.1. Current Limitations
6.3.2. Future Research Directions
7. Conclusions
- Stacking ensembles achieve superior performance with AUC-ROC scores up to 0.943, representing the state-of-the-art for fraud detection accuracy.
- Random Forest and LightGBM provide optimal balance of performance and efficiency for real-time deployment scenarios.
- Class balancing techniques, particularly SMOTE, are essential for achieving optimal ensemble performance across all methods.
- Feature engineering provides consistent improvements across all ensemble approaches, with systematic feature development yielding up to 19.1% performance gains.
- Method selection should be guided by specific deployment requirements, balancing accuracy, computational efficiency, and interpretability needs.
References
- F. Almalki and M. Masud, "Financial Fraud Detection Using Explainable AI and Stacking Ensemble Methods," arXiv preprint arXiv:2505.10050, 2025. [CrossRef]
- A. R. Khalid, N. Owoh, O. Uthmani, M. Ashawa, J. Osamor, and J. Adejoh, "Enhancing credit card fraud detection: an ensemble machine learning approach," Big Data Cogn. Comput., vol. 8, no. 1, p. 6, 2024. [CrossRef]
- M. A. Talukder, M. Khalid, and M. A. Uddin, "An integrated multistage ensemble machine learning model for fraudulent transaction detection," J. Big Data, vol. 11, no. 1, p. 168, 2024. [CrossRef]
- F. Moradi, M. Tarif, and M. Homaei, "Ensemble-Based Fraud Detection: A Robust Approach Evaluated on IEEE-CIS," Preprints.org, 2025. [Online]. Available: https://www.preprints.org.
- F. Moradi, M. Tarif, and M. Homaei, "Robust Fraud Detection with Ensemble Learning: A Case Study on the IEEE-CIS Dataset," Preprint, 2025. [Online]. Available: https://www.preprints.org.
- F. Moradi, M. Tarif, and M. Homaei, "A Systematic Review of Machine Learning in Credit Card Fraud Detection," Preprint, MDPI AG, 2025. [Online]. Available: https://www.preprints.org.
- X. Fan and T. J. Boonen, "Machine Learning Algorithms for Credit Card Fraud Detection: Cost-Sensitive and Ensemble Learning Enhancements," SSRN, 2025. [Preprint] . [CrossRef]
- F. Moradi, M. Tarif, and M. Homaei, "Semi-Supervised Supply Chain Fraud Detection with Unsupervised Pre-Filtering," arXiv preprint arXiv:2508.06574, 2025. [CrossRef]
- E. Ileberi, Y. Sun, and Z. Wang, "A machine learning based credit card fraud detection using the GA algorithm for feature selection," J. Big Data, vol. 9, no. 1, p. 24, 2022. [CrossRef]
- A. Singh and A. Jain, "An efficient credit card fraud detection approach using cost-sensitive weak learner with imbalanced dataset," Comput. Intell., vol. 38, no. 6, pp. 2035–2055, 2022. [CrossRef]
- R. Cao, J. Wang, M. Mao, G. Liu, and C. Jiang, "Feature-wise attention based boosting ensemble method for fraud detection," Eng. Appl. Artif. Intell., vol. 126, p. 106975, 2023. [CrossRef]
- J. Forough and S. Momtazi, "Ensemble of deep sequential models for credit card fraud detection," Appl. Soft Comput., vol. 99, p. 106883, 2021. [CrossRef]
- E. Ileberi and Y. Sun, "A Hybrid Deep Learning Ensemble Model for Credit Card Fraud Detection," IEEE Access, vol. 12, pp. 175829–175838, 2024. [CrossRef]
- E. Esenogho, I. D. Mienye, T. G. Swart, K. Aruleba, and G. Obaido, "A neural network ensemble with feature engineering for improved credit card fraud detection," IEEE Access, vol. 10, pp. 16400–16407, 2022. [CrossRef]
- T. Awosika, R. M. Shukla, and B. Pranggono, "Transparency and privacy: the role of explainable AI and federated learning in financial fraud detection," IEEE Access, vol. 12, pp. 64551–64560, 2024. [CrossRef]
- S. Visbeek, E. Acar, and F. den Hengst, "Explainable fraud detection with deep symbolic classification," in *Proc. World Conf. Explainable Artificial Intelligence*, pp. 350–373, 2024. [CrossRef]
- Y. Zhou, H. Li, Z. Xiao, and J. Qiu, "A user-centered explainable artificial intelligence approach for financial fraud detection," Finance Res. Lett., vol. 58, p. 104309, 2023. [CrossRef]
- X. Zhang, Y. Han, W. Xu, and Q. Wang, "HOBA: A novel feature engineering methodology for credit card fraud detection with a deep learning architecture," Inf. Sci., vol. 557, pp. 302–316, 2021. [CrossRef]
| Method | AUC-ROC | AUC-PR | F1-Score | Precision | Recall | MCC |
|---|---|---|---|---|---|---|
| Stacking Ensemble | 0.943 | 0.891 | 0.856 | 0.923 | 0.798 | 0.847 |
| XGBoost | 0.915 | 0.834 | 0.798 | 0.887 | 0.725 | 0.789 |
| LightGBM | 0.908 | 0.828 | 0.791 | 0.881 | 0.718 | 0.782 |
| Random Forest | 0.895 | 0.802 | 0.765 | 0.864 | 0.685 | 0.756 |
| CatBoost | 0.902 | 0.821 | 0.775 | 0.869 | 0.701 | 0.768 |
| Voting Ensemble | 0.889 | 0.785 | 0.742 | 0.845 | 0.658 | 0.731 |
| Neural Ensemble | 0.876 | 0.756 | 0.718 | 0.812 | 0.641 | 0.705 |
| Method | Training Time | Memory Usage | Inference Time | Scalability | Parallelization |
|---|---|---|---|---|---|
| (minutes) | (GB) | (ms) | Rating | Support | |
| Random Forest | 12.4 | 3.2 | 15.6 | High | Excellent |
| XGBoost | 18.7 | 4.8 | 22.3 | High | Good |
| LightGBM | 14.2 | 3.9 | 18.9 | High | Excellent |
| CatBoost | 25.1 | 5.4 | 28.7 | Medium | Good |
| Stacking Ensemble | 45.6 | 8.1 | 52.4 | Medium | Limited |
| Voting Ensemble | 31.8 | 6.3 | 38.2 | Medium | Good |
| Neural Ensemble | 67.3 | 12.6 | 45.9 | Low | Limited |
| Method | No Sampling | SMOTE | ADASYN | |||
|---|---|---|---|---|---|---|
| F1 | AUC-PR | F1 | AUC-PR | F1 | AUC-PR | |
| Random Forest | 0.672 | 0.584 | 0.765 | 0.802 | 0.748 | 0.789 |
| XGBoost | 0.689 | 0.612 | 0.798 | 0.834 | 0.782 | 0.821 |
| LightGBM | 0.678 | 0.598 | 0.791 | 0.828 | 0.776 | 0.815 |
| Stacking Ensemble | 0.724 | 0.656 | 0.856 | 0.891 | 0.841 | 0.878 |
| Voting Ensemble | 0.651 | 0.572 | 0.742 | 0.785 | 0.728 | 0.771 |
| Feature Set | Random Forest | XGBoost | LightGBM | Stacking | Features | Improvement |
|---|---|---|---|---|---|---|
| F1-Score | F1-Score | F1-Score | F1-Score | Count | (%) | |
| Baseline (Raw) | 0.642 | 0.671 | 0.665 | 0.698 | 431 | - |
| + Temporal Features | 0.689 | 0.718 | 0.712 | 0.745 | 446 | +6.8% |
| + Amount Features | 0.721 | 0.751 | 0.744 | 0.778 | 458 | +4.4% |
| + Aggregation Features | 0.748 | 0.779 | 0.772 | 0.806 | 486 | +3.6% |
| + Interaction Features | 0.765 | 0.798 | 0.791 | 0.856 | 512 | +6.2% |
| Method | IEEE-CIS | European Cards | Synthetic Dataset | Avg | |||
|---|---|---|---|---|---|---|---|
| F1 | AUC | F1 | AUC | F1 | AUC | Rank | |
| Stacking Ensemble | 0.856 | 0.943 | 0.834 | 0.918 | 0.821 | 0.905 | 1.0 |
| XGBoost | 0.798 | 0.915 | 0.782 | 0.897 | 0.771 | 0.883 | 2.0 |
| LightGBM | 0.791 | 0.908 | 0.775 | 0.889 | 0.764 | 0.876 | 3.0 |
| Random Forest | 0.765 | 0.895 | 0.748 | 0.872 | 0.739 | 0.859 | 4.0 |
| CatBoost | 0.775 | 0.902 | 0.759 | 0.881 | 0.751 | 0.867 | 3.7 |
| Voting Ensemble | 0.742 | 0.889 | 0.728 | 0.865 | 0.715 | 0.849 | 5.0 |
| Method | Native | SHAP | Feature | Rule | Overall |
|---|---|---|---|---|---|
| Interpretability | Compatibility | Importance | Extraction | Score | |
| Random Forest | High | Excellent | Native | Good | 4.5/5 |
| XGBoost | Medium | Excellent | Native | Limited | 3.8/5 |
| LightGBM | Medium | Excellent | Native | Limited | 3.8/5 |
| CatBoost | Medium | Good | Native | Limited | 3.5/5 |
| Stacking Ensemble | Low | Limited | Derived | Poor | 2.2/5 |
| Voting Ensemble | Medium | Good | Averaged | Limited | 3.0/5 |
| Neural Ensemble | Very Low | Limited | Post-hoc | Very Poor | 1.8/5 |
| Method | Latency | Throughput | Memory | Model Size | Update | Deployment |
|---|---|---|---|---|---|---|
| (ms) | (TPS) | Footprint | (MB) | Speed | Complexity | |
| Random Forest | 15.6 | 8,500 | Low | 45.2 | Fast | Low |
| XGBoost | 22.3 | 6,200 | Medium | 67.8 | Medium | Medium |
| LightGBM | 18.9 | 7,400 | Low | 52.1 | Fast | Low |
| CatBoost | 28.7 | 4,800 | Medium | 78.9 | Slow | Medium |
| Stacking Ensemble | 52.4 | 2,100 | High | 156.7 | Very Slow | High |
| Voting Ensemble | 38.2 | 3,500 | Medium | 98.4 | Slow | Medium |
| Neural Ensemble | 45.9 | 2,800 | Very High | 234.6 | Very Slow | Very High |
| Method | Comparison Methods | ||||||
|---|---|---|---|---|---|---|---|
| RF | XGB | LGBM | Cat | Stack | Vote | Neural | |
| Random Forest | – | 0.032 | 0.087 | 0.156 | <0.001 | 0.234 | 0.012 |
| XGBoost | 0.032 | – | 0.445 | 0.089 | <0.001 | <0.001 | <0.001 |
| LightGBM | 0.087 | 0.445 | – | 0.178 | <0.001 | 0.001 | <0.001 |
| CatBoost | 0.156 | 0.089 | 0.178 | – | <0.001 | 0.067 | <0.001 |
| Stacking | <0.001 | <0.001 | <0.001 | <0.001 | – | <0.001 | <0.001 |
| Voting | 0.234 | <0.001 | 0.001 | 0.067 | <0.001 | – | 0.045 |
| Neural | 0.012 | <0.001 | <0.001 | <0.001 | <0.001 | 0.045 | – |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).