1. Introduction
The rapid expansion of digital payment systems has led to a corresponding increase in credit card fraud, making fraud detection one of the most critical challenges in modern financial technology [
1]. According to industry reports, financial losses from credit card fraud continue to escalate, with global losses reaching unprecedented levels as fraudulent schemes become increasingly sophisticated [
2].
Traditional rule-based fraud detection systems have proven inadequate for addressing the complexity and adaptability of modern fraud patterns. Machine learning approaches, particularly ensemble methods, have emerged as the predominant solution for effective fraud detection [
3]. Ensemble learning combines multiple base learners to create more robust and accurate models, offering significant advantages in handling the inherent challenges of fraud detection, including extreme class imbalance, concept drift, and the need for real-time processing.
The IEEE-CIS (Institute of Electrical and Electronics Engineers - Computational Intelligence Society) fraud detection dataset has become the de facto standard benchmark for evaluating fraud detection algorithms [
4,
5]. This dataset provides a realistic testing environment with 590,540 transactions containing a 3.5% fraud rate, closely mimicking real-world scenarios where fraudulent transactions represent a small minority of all transactions. Recent comprehensive studies have demonstrated the effectiveness of ensemble approaches on this benchmark [
6].
Several factors motivate this comprehensive comparative study. First, while numerous ensemble methods have been proposed for fraud detection, few studies provide systematic comparisons across multiple algorithms using consistent evaluation protocols. Second, the integration of advanced data balancing techniques with ensemble methods requires thorough investigation to understand their combined effectiveness. Third, the emerging emphasis on explainable AI in financial applications necessitates analysis of ensemble methods from interpretability perspectives.
This paper makes several key contributions to credit card fraud detection research:
Comprehensive Evaluation Framework: We establish a systematic evaluation protocol for comparing ensemble methods across multiple performance dimensions including accuracy, computational efficiency, and interpretability.
Performance Benchmarking: We provide detailed performance analysis of six major ensemble categories: tree-based methods, gradient boosting approaches, stacking ensembles, voting classifiers, neural ensemble methods, and hybrid approaches.
Feature Engineering Analysis: We investigate the impact of various feature engineering strategies specifically optimized for ensemble methods in fraud detection contexts.
Practical Deployment Guidelines: We offer evidence-based recommendations for selecting appropriate ensemble methods based on specific deployment requirements and constraints.
The remainder of this paper is organized as follows.
Section 2 provides a comprehensive review of related work, covering ensemble learning applications in fraud detection, class imbalance handling techniques, deep learning approaches, and explainable AI integration.
Section 3 details our methodology, including the ensemble learning framework and data preprocessing pipeline.
Section 4 describes the experimental setup, including dataset specifications and evaluation protocols.
Section 5 presents our results and discussion, providing detailed performance comparisons across computational efficiency, class imbalance handling effectiveness, feature engineering impact, cross-dataset generalization, interpretability analysis, real-time deployment considerations, and statistical significance testing.
Section 6 discusses key findings, practical implications for system deployment, current limitations, and future research directions. Finally,
Section 7 concludes with a summary of contributions and evidence-based recommendations for ensemble method selection in credit card fraud detection systems.
2. Related Work
2.1. Ensemble Learning in Fraud Detection
Ensemble learning has demonstrated significant effectiveness in credit card fraud detection applications. Recent comprehensive evaluations by Moradi et al. [
4] demonstrated that ensemble-based approaches achieve superior performance on the IEEE-CIS dataset, establishing new benchmarks for fraud detection accuracy. Their systematic analysis [
6] provides extensive comparison of machine learning techniques, highlighting the consistent advantages of ensemble methods over individual classifiers.
Khalid et al. [
2] presented a comprehensive ensemble approach integrating multiple machine learning algorithms with data balancing techniques, achieving substantial improvements in detection accuracy. Fan and Boonen [
7] contributed significant insights into cost-sensitive and ensemble learning enhancements, demonstrating that sophisticated cost-aware approaches can substantially improve practical fraud detection performance.
Moradi et al. [
5] conducted extensive case studies on the IEEE-CIS dataset, showing that robust ensemble learning frameworks can achieve AUC-ROC scores exceeding 0.94. Their work highlighted the potential of ensemble methods to address multiple challenges simultaneously, including class imbalance and generalization requirements.
Talukder et al. [
3] introduced an integrated multistage ensemble machine learning model that effectively combines diverse ensemble techniques. Their approach demonstrated that sophisticated ensemble architectures can significantly enhance fraud detection accuracy while maintaining computational efficiency suitable for real-time applications.
Recent advances in ensemble methodology have focused on stacking approaches and hybrid methods. Almalki and Masud [
1] proposed a fraud detection framework combining stacking ensemble methods with explainable AI techniques. Their approach achieved 99% accuracy while incorporating model interpretability features, demonstrating the potential of combining high performance with transparency requirements. Moradi et al. [
8] explored semi-supervised approaches for supply chain fraud detection, showing that hybrid methods combining unsupervised pre-filtering with supervised learning can effectively handle limited labeled data scenarios.
2.2. Class Imbalance Handling Techniques
Class imbalance represents one of the most significant challenges in fraud detection research. The Synthetic Minority Oversampling Technique (SMOTE) has become the standard approach for addressing class imbalance [
9]. Recent research has explored various SMOTE variants and their integration with ensemble methods.
Ileberi et al. [
9] demonstrated the effectiveness of combining genetic algorithm-based feature selection with Random Forest ensembles, achieving near-perfect accuracy through sophisticated preprocessing and ensemble design. Their work established important precedents for integrating optimization techniques with ensemble learning.
Singh and Jain [
10] investigated cost-sensitive learning approaches combined with ensemble methods, showing that careful cost matrix design can significantly improve performance on extremely imbalanced datasets. Cao et al. [
11] developed feature-wise attention mechanisms for ensemble methods, demonstrating that sophisticated feature weighting can enhance ensemble performance.
2.3. Deep Learning and Advanced Approaches
Deep learning approaches have shown increasing promise for fraud detection applications. Forough and Momtazi [
12] developed ensemble deep sequential models specifically for credit card fraud detection, combining multiple deep learning architectures to achieve superior temporal pattern recognition. Ileberi and Sun [
13] proposed hybrid deep learning ensemble models that integrate traditional machine learning with deep neural networks.
Esenogho et al. [
14] demonstrated that neural network ensembles with sophisticated feature engineering can achieve excellent performance while maintaining reasonable computational requirements. Their work highlighted the importance of careful architecture design for ensemble neural networks.
2.4. Explainable AI Integration
The integration of explainable AI with ensemble methods has gained significant attention due to regulatory requirements in financial applications. Awosika et al. [
15] investigated the role of explainable AI and federated learning in financial fraud detection, emphasizing the importance of model transparency for regulatory compliance.
Visbeek et al. [
16] developed explainable fraud detection using deep symbolic classification, demonstrating innovative approaches for maintaining interpretability in complex ensemble systems. Their work on symbolic classification provides practical frameworks for extracting interpretable rules from ensemble predictions while preserving detection accuracy. Zhou et al. [
17] proposed user-centered explainable AI approaches that balance performance with interpretability requirements, showing that careful design can achieve both high accuracy and meaningful explanations.
3. Methodology
3.1. Ensemble Learning Framework
Our evaluation framework encompasses six major categories of ensemble methods, each representing different approaches to combining base learners for fraud detection:
Tree-based Ensembles: Random Forest and Extra Trees methods that combine multiple decision trees with different randomization strategies.
Gradient Boosting Methods: XGBoost, LightGBM, and CatBoost algorithms that sequentially build models to correct previous predictions.
Stacking Ensembles: Multi-level learning approaches that use meta-learners to combine base model predictions.
Voting Classifiers: Hard and soft voting approaches that combine predictions through majority voting or probability averaging.
Neural Ensemble Methods: Multiple neural network architectures combined through various aggregation strategies.
Hybrid Approaches: Methods that combine traditional machine learning with deep learning components.
3.2. Data Preprocessing Pipeline
Our preprocessing pipeline addresses the specific challenges of fraud detection datasets through a systematic four-stage approach:
3.2.1. Feature Engineering
We implement domain-specific feature engineering strategies optimized for ensemble methods. Zhang et al. [
18] developed the HOBA (Histogram-based Outlier score, Behavioral and Amateurish) methodology specifically for credit card fraud detection, providing a framework for systematic feature construction. Following their approach, temporal features are decomposed into cyclical components, transaction amounts undergo log transformation to handle skewness, and aggregation features capture behavioral patterns across different time windows. Advanced feature engineering techniques include velocity features, frequency encoding, and interaction features that capture complex relationships between transaction attributes.
3.2.2. Class Imbalance Handling
We evaluate multiple class balancing techniques including SMOTE [
9], ADASYN, and Borderline-SMOTE. Each technique is systematically compared across different ensemble methods to understand their interaction effects.
3.2.3. Feature Selection and Dimensionality Reduction
We apply mutual information-based feature selection and correlation analysis to reduce dimensionality while preserving ensemble performance. Principal Component Analysis is evaluated for its impact on ensemble interpretability.
3.2.4. Data Quality Assurance
Missing value imputation strategies are customized for financial data, with domain-specific approaches for different feature types. Outlier detection and treatment are implemented to improve ensemble robustness.
3.3. Evaluation Metrics and Protocols
Given the imbalanced nature of fraud detection, we employ comprehensive evaluation metrics specifically designed for imbalanced classification problems. Recent advances in cost-sensitive learning [
7] emphasize the importance of incorporating business-relevant costs into evaluation frameworks:
Area Under ROC Curve (AUC-ROC): Primary metric for overall discrimination capability
Area Under Precision-Recall Curve (AUC-PR): Critical metric for imbalanced datasets
F1-Score: Harmonic mean of precision and recall
Balanced Accuracy: Average of sensitivity and specificity
Matthews Correlation Coefficient (MCC): Correlation between predictions and true labels
Cost-Sensitive Metrics: Business-relevant metrics incorporating misclassification costs
4. Experimental Setup
4.1. Datasets
Our experimental evaluation utilizes multiple datasets to ensure comprehensive assessment, building on recent systematic studies that have established standardized evaluation protocols [
6].
4.1.1. Primary Dataset: IEEE-CIS
The IEEE-CIS fraud detection dataset serves as our primary benchmark, containing 590,540 transactions with a 3.5% fraud rate. The dataset includes 431 features across transaction and identity information, providing a realistic testing environment for ensemble methods.
4.1.2. Secondary Datasets
We include additional datasets for cross-validation of results, ensuring that findings generalize beyond a single data source. These datasets vary in size, feature complexity, and fraud rates to test ensemble robustness.
4.2. Experimental Design
We implement a rigorous experimental protocol using stratified 10-fold cross-validation to ensure reliable performance estimation. Each ensemble method is evaluated using identical data splits to enable fair comparison.
4.2.1. Hyperparameter Optimization
We employ Bayesian optimization with 5-fold cross-validation for hyperparameter tuning. The optimization process balances performance with computational efficiency, using early stopping to prevent overfitting.
4.2.2. Statistical Significance Testing
We apply paired t-tests with Bonferroni correction to assess statistical significance of performance differences. Effect sizes are calculated using Cohen’s d to determine practical significance.
5. Results and Discussion
5.1. Overall Performance Comparison
Table 1 presents the comprehensive performance comparison of all evaluated ensemble methods on the IEEE-CIS dataset. The results demonstrate clear performance hierarchies across different ensemble approaches.
The stacking ensemble achieves superior performance across all key metrics, with AUC-ROC of 0.943 and F1-score of 0.856. This represents significant improvements of 3.1% in AUC-ROC and 7.3% in F1-score compared to the best individual gradient boosting method (XGBoost).
5.2. Computational Efficiency Analysis
Table 2 analyzes the computational characteristics of different ensemble methods, considering both training and inference requirements.
While stacking ensembles achieve the best performance, they require significantly more computational resources. Random Forest provides the best balance of performance and efficiency, making it suitable for resource-constrained environments.
5.3. Class Imbalance Handling Effectiveness
Table 3 evaluates the effectiveness of different class balancing techniques across ensemble methods.
SMOTE consistently provides the most significant improvements across all ensemble methods, with average F1-score improvements ranging from 13.9% (Random Forest) to 18.2% (Stacking Ensemble).
5.4. Feature Engineering Impact Analysis
Table 4 demonstrates the impact of various feature engineering strategies on ensemble performance.
The systematic addition of engineered features provides consistent improvements, with the complete feature engineering pipeline delivering 19.1% improvement in F1-score for stacking ensembles compared to baseline features. This validates the importance of domain-specific feature engineering methodologies such as HOBA [
18] for maximizing ensemble performance.
5.5. Cross-Dataset Generalization Analysis
Table 5 evaluates the generalization capability of ensemble methods across different datasets.
Stacking ensembles demonstrate superior generalization across datasets, maintaining top performance rankings consistently. The performance degradation across datasets is minimal (2.6% average F1-score decline), indicating robust generalization capabilities.
5.6. Interpretability and Explainability Analysis
Table 6 evaluates ensemble methods from interpretability and explainability perspectives, crucial for financial applications.
Random Forest provides the best interpretability characteristics while maintaining competitive performance, making it suitable for regulatory environments requiring model transparency.
5.7. Real-Time Deployment Considerations
Table 7 analyzes practical deployment characteristics for real-time fraud detection systems.
LightGBM and Random Forest emerge as optimal choices for high-throughput real-time systems, offering excellent performance-efficiency trade-offs.
5.8. Statistical Significance Analysis
Table 8 presents statistical significance testing results comparing ensemble methods.
The stacking ensemble shows statistically significant improvements over all other methods, confirming its superior performance is not due to random variation.
The stacking ensemble shows statistically significant improvements over all other methods, confirming its superior performance is not due to random variation.
6. Discussion
6.1. Key Findings
Our comprehensive evaluation reveals several important insights for ensemble method selection in credit card fraud detection:
Performance Hierarchy: Stacking ensembles consistently achieve superior performance across multiple metrics and datasets, demonstrating the effectiveness of meta-learning approaches for fraud detection. However, this performance comes at the cost of increased computational complexity and reduced interpretability.
Computational Trade-offs: While sophisticated ensemble methods like stacking achieve the best accuracy, simpler approaches like Random Forest and LightGBM provide excellent performance-efficiency trade-offs suitable for high-throughput production environments.
Class Imbalance Sensitivity: All ensemble methods benefit significantly from proper class balancing techniques, with SMOTE providing the most consistent improvements across different algorithms. The interaction between balancing techniques and ensemble methods requires careful consideration during system design.
Feature Engineering Impact: Systematic feature engineering provides substantial improvements across all ensemble methods, with the benefits being particularly pronounced for stacking approaches. The investment in domain-specific feature engineering yields consistent returns across different algorithmic approaches.
6.2. Practical Implications
6.2.1. Method Selection Guidelines
Based on our analysis, we provide evidence-based guidelines for ensemble method selection:
Maximum Performance Requirements: Stacking ensembles for scenarios where accuracy is paramount and computational resources are abundant.
Real-time Systems: LightGBM or Random Forest for high-throughput environments requiring sub-second response times.
Interpretability Requirements: Random Forest for regulatory environments requiring model transparency and explainability, with post-hoc explanation techniques [
16] for enhanced interpretability.
Balanced Requirements: XGBoost for general-purpose applications requiring good performance with reasonable computational overhead.
6.2.2. Implementation Considerations
Our results highlight several critical implementation considerations for production fraud detection systems:
Data Pipeline Design: The preprocessing pipeline significantly impacts ensemble performance. Careful feature engineering and class balancing strategies must be integrated into the data pipeline design from the beginning.
Model Update Strategies: Different ensemble methods have varying requirements for model updates and retraining. Simpler methods like Random Forest support more frequent updates, while complex stacking approaches may require batch retraining.
Monitoring and Maintenance: Ensemble methods require different monitoring approaches. Performance degradation may manifest differently across ensemble components, requiring sophisticated monitoring strategies.
6.3. Limitations and Future Work
6.3.1. Current Limitations
Our study has several limitations that should be acknowledged:
Dataset Specificity: While we evaluated multiple datasets, the findings may not generalize to all fraud detection scenarios, particularly those with different fraud patterns or data characteristics.
Concept Drift: Our evaluation assumes static fraud patterns. Real-world deployment requires handling concept drift and evolving fraud techniques, which may affect ensemble performance differently.
Cost Sensitivity: Our evaluation focuses primarily on statistical metrics. Real-world deployment requires incorporating business-specific costs for different types of errors, as demonstrated by recent advances in cost-sensitive learning approaches [
7].
6.3.2. Future Research Directions
Several research directions emerge from our findings:
Adaptive Ensemble Methods: Development of ensemble approaches that can automatically adapt to changing fraud patterns and data distributions, building on semi-supervised learning frameworks [
8].
Federated Ensemble Learning: Investigation of ensemble methods that can learn across multiple institutions while preserving privacy requirements.
Quantum-Enhanced Ensembles: Exploration of quantum computing approaches for ensemble learning in fraud detection applications.
Edge Computing Optimization: Development of ensemble methods optimized for edge computing environments with strict resource constraints.
7. Conclusions
This comprehensive comparative analysis of ensemble methods for credit card fraud detection provides evidence-based guidance for algorithm selection and implementation, building on recent systematic reviews and robust experimental studies. Our systematic evaluation across multiple datasets demonstrates that ensemble methods consistently outperform individual classifiers, with stacking approaches achieving the best overall performance.
Key conclusions from our research include:
Stacking ensembles achieve superior performance with AUC-ROC scores up to 0.943, representing the state-of-the-art for fraud detection accuracy.
Random Forest and LightGBM provide optimal balance of performance and efficiency for real-time deployment scenarios.
Class balancing techniques, particularly SMOTE, are essential for achieving optimal ensemble performance across all methods.
Feature engineering provides consistent improvements across all ensemble approaches, with systematic feature development yielding up to 19.1% performance gains.
Method selection should be guided by specific deployment requirements, balancing accuracy, computational efficiency, and interpretability needs.
The findings provide practical guidance for implementing ensemble-based fraud detection systems in production environments. As fraudulent schemes continue to evolve, ensemble methods offer the robustness and adaptability necessary for effective detection while maintaining the performance characteristics required for real-world deployment.
Future research should focus on developing adaptive ensemble approaches that can handle concept drift and evolving fraud patterns while maintaining the performance advantages demonstrated in this study. The integration of emerging technologies such as federated learning and quantum computing with ensemble methods represents promising directions for advancing the state-of-the-art in fraud detection.
References
- F. Almalki and M. Masud, "Financial Fraud Detection Using Explainable AI and Stacking Ensemble Methods," arXiv preprint arXiv:2505.10050, 2025. [CrossRef]
- A. R. Khalid, N. Owoh, O. Uthmani, M. Ashawa, J. Osamor, and J. Adejoh, "Enhancing credit card fraud detection: an ensemble machine learning approach," Big Data Cogn. Comput., vol. 8, no. 1, p. 6, 2024. [CrossRef]
- M. A. Talukder, M. Khalid, and M. A. Uddin, "An integrated multistage ensemble machine learning model for fraudulent transaction detection," J. Big Data, vol. 11, no. 1, p. 168, 2024. [CrossRef]
- F. Moradi, M. Tarif, and M. Homaei, "Ensemble-Based Fraud Detection: A Robust Approach Evaluated on IEEE-CIS," Preprints.org, 2025. [Online]. Available: https://www.preprints.org.
- F. Moradi, M. Tarif, and M. Homaei, "Robust Fraud Detection with Ensemble Learning: A Case Study on the IEEE-CIS Dataset," Preprint, 2025. [Online]. Available: https://www.preprints.org.
- F. Moradi, M. Tarif, and M. Homaei, "A Systematic Review of Machine Learning in Credit Card Fraud Detection," Preprint, MDPI AG, 2025. [Online]. Available: https://www.preprints.org.
- X. Fan and T. J. Boonen, "Machine Learning Algorithms for Credit Card Fraud Detection: Cost-Sensitive and Ensemble Learning Enhancements," SSRN, 2025. [Preprint] . [CrossRef]
- F. Moradi, M. Tarif, and M. Homaei, "Semi-Supervised Supply Chain Fraud Detection with Unsupervised Pre-Filtering," arXiv preprint arXiv:2508.06574, 2025. [CrossRef]
- E. Ileberi, Y. Sun, and Z. Wang, "A machine learning based credit card fraud detection using the GA algorithm for feature selection," J. Big Data, vol. 9, no. 1, p. 24, 2022. [CrossRef]
- A. Singh and A. Jain, "An efficient credit card fraud detection approach using cost-sensitive weak learner with imbalanced dataset," Comput. Intell., vol. 38, no. 6, pp. 2035–2055, 2022. [CrossRef]
- R. Cao, J. Wang, M. Mao, G. Liu, and C. Jiang, "Feature-wise attention based boosting ensemble method for fraud detection," Eng. Appl. Artif. Intell., vol. 126, p. 106975, 2023. [CrossRef]
- J. Forough and S. Momtazi, "Ensemble of deep sequential models for credit card fraud detection," Appl. Soft Comput., vol. 99, p. 106883, 2021. [CrossRef]
- E. Ileberi and Y. Sun, "A Hybrid Deep Learning Ensemble Model for Credit Card Fraud Detection," IEEE Access, vol. 12, pp. 175829–175838, 2024. [CrossRef]
- E. Esenogho, I. D. Mienye, T. G. Swart, K. Aruleba, and G. Obaido, "A neural network ensemble with feature engineering for improved credit card fraud detection," IEEE Access, vol. 10, pp. 16400–16407, 2022. [CrossRef]
- T. Awosika, R. M. Shukla, and B. Pranggono, "Transparency and privacy: the role of explainable AI and federated learning in financial fraud detection," IEEE Access, vol. 12, pp. 64551–64560, 2024. [CrossRef]
- S. Visbeek, E. Acar, and F. den Hengst, "Explainable fraud detection with deep symbolic classification," in *Proc. World Conf. Explainable Artificial Intelligence*, pp. 350–373, 2024. [CrossRef]
- Y. Zhou, H. Li, Z. Xiao, and J. Qiu, "A user-centered explainable artificial intelligence approach for financial fraud detection," Finance Res. Lett., vol. 58, p. 104309, 2023. [CrossRef]
- X. Zhang, Y. Han, W. Xu, and Q. Wang, "HOBA: A novel feature engineering methodology for credit card fraud detection with a deep learning architecture," Inf. Sci., vol. 557, pp. 302–316, 2021. [CrossRef]
Table 1.
Overall Performance Comparison on IEEE-CIS Dataset.
Table 1.
Overall Performance Comparison on IEEE-CIS Dataset.
| Method |
AUC-ROC |
AUC-PR |
F1-Score |
Precision |
Recall |
MCC |
| Stacking Ensemble |
0.943 |
0.891 |
0.856 |
0.923 |
0.798 |
0.847 |
| XGBoost |
0.915 |
0.834 |
0.798 |
0.887 |
0.725 |
0.789 |
| LightGBM |
0.908 |
0.828 |
0.791 |
0.881 |
0.718 |
0.782 |
| Random Forest |
0.895 |
0.802 |
0.765 |
0.864 |
0.685 |
0.756 |
| CatBoost |
0.902 |
0.821 |
0.775 |
0.869 |
0.701 |
0.768 |
| Voting Ensemble |
0.889 |
0.785 |
0.742 |
0.845 |
0.658 |
0.731 |
| Neural Ensemble |
0.876 |
0.756 |
0.718 |
0.812 |
0.641 |
0.705 |
Table 2.
Computational Efficiency Analysis.
Table 2.
Computational Efficiency Analysis.
| Method |
Training Time |
Memory Usage |
Inference Time |
Scalability |
Parallelization |
| |
(minutes) |
(GB) |
(ms) |
Rating |
Support |
| Random Forest |
12.4 |
3.2 |
15.6 |
High |
Excellent |
| XGBoost |
18.7 |
4.8 |
22.3 |
High |
Good |
| LightGBM |
14.2 |
3.9 |
18.9 |
High |
Excellent |
| CatBoost |
25.1 |
5.4 |
28.7 |
Medium |
Good |
| Stacking Ensemble |
45.6 |
8.1 |
52.4 |
Medium |
Limited |
| Voting Ensemble |
31.8 |
6.3 |
38.2 |
Medium |
Good |
| Neural Ensemble |
67.3 |
12.6 |
45.9 |
Low |
Limited |
Table 3.
Impact of Class Imbalance Handling Techniques.
Table 3.
Impact of Class Imbalance Handling Techniques.
| Method |
No Sampling |
SMOTE |
ADASYN |
| F1 |
AUC-PR |
F1 |
AUC-PR |
F1 |
AUC-PR |
| Random Forest |
0.672 |
0.584 |
0.765 |
0.802 |
0.748 |
0.789 |
| XGBoost |
0.689 |
0.612 |
0.798 |
0.834 |
0.782 |
0.821 |
| LightGBM |
0.678 |
0.598 |
0.791 |
0.828 |
0.776 |
0.815 |
| Stacking Ensemble |
0.724 |
0.656 |
0.856 |
0.891 |
0.841 |
0.878 |
| Voting Ensemble |
0.651 |
0.572 |
0.742 |
0.785 |
0.728 |
0.771 |
Table 4.
Feature Engineering Impact on Ensemble Performance.
Table 4.
Feature Engineering Impact on Ensemble Performance.
| Feature Set |
Random Forest |
XGBoost |
LightGBM |
Stacking |
Features |
Improvement |
| |
F1-Score |
F1-Score |
F1-Score |
F1-Score |
Count |
(%) |
| Baseline (Raw) |
0.642 |
0.671 |
0.665 |
0.698 |
431 |
- |
| + Temporal Features |
0.689 |
0.718 |
0.712 |
0.745 |
446 |
+6.8% |
| + Amount Features |
0.721 |
0.751 |
0.744 |
0.778 |
458 |
+4.4% |
| + Aggregation Features |
0.748 |
0.779 |
0.772 |
0.806 |
486 |
+3.6% |
| + Interaction Features |
0.765 |
0.798 |
0.791 |
0.856 |
512 |
+6.2% |
Table 5.
Cross-Dataset Performance Analysis.
Table 5.
Cross-Dataset Performance Analysis.
| Method |
IEEE-CIS |
European Cards |
Synthetic Dataset |
Avg |
| F1 |
AUC |
F1 |
AUC |
F1 |
AUC |
Rank |
| Stacking Ensemble |
0.856 |
0.943 |
0.834 |
0.918 |
0.821 |
0.905 |
1.0 |
| XGBoost |
0.798 |
0.915 |
0.782 |
0.897 |
0.771 |
0.883 |
2.0 |
| LightGBM |
0.791 |
0.908 |
0.775 |
0.889 |
0.764 |
0.876 |
3.0 |
| Random Forest |
0.765 |
0.895 |
0.748 |
0.872 |
0.739 |
0.859 |
4.0 |
| CatBoost |
0.775 |
0.902 |
0.759 |
0.881 |
0.751 |
0.867 |
3.7 |
| Voting Ensemble |
0.742 |
0.889 |
0.728 |
0.865 |
0.715 |
0.849 |
5.0 |
Table 6.
Interpretability and Explainability Assessment.
Table 6.
Interpretability and Explainability Assessment.
| Method |
Native |
SHAP |
Feature |
Rule |
Overall |
| |
Interpretability |
Compatibility |
Importance |
Extraction |
Score |
| Random Forest |
High |
Excellent |
Native |
Good |
4.5/5 |
| XGBoost |
Medium |
Excellent |
Native |
Limited |
3.8/5 |
| LightGBM |
Medium |
Excellent |
Native |
Limited |
3.8/5 |
| CatBoost |
Medium |
Good |
Native |
Limited |
3.5/5 |
| Stacking Ensemble |
Low |
Limited |
Derived |
Poor |
2.2/5 |
| Voting Ensemble |
Medium |
Good |
Averaged |
Limited |
3.0/5 |
| Neural Ensemble |
Very Low |
Limited |
Post-hoc |
Very Poor |
1.8/5 |
Table 7.
Real-Time Deployment Analysis.
Table 7.
Real-Time Deployment Analysis.
| Method |
Latency |
Throughput |
Memory |
Model Size |
Update |
Deployment |
| |
(ms) |
(TPS) |
Footprint |
(MB) |
Speed |
Complexity |
| Random Forest |
15.6 |
8,500 |
Low |
45.2 |
Fast |
Low |
| XGBoost |
22.3 |
6,200 |
Medium |
67.8 |
Medium |
Medium |
| LightGBM |
18.9 |
7,400 |
Low |
52.1 |
Fast |
Low |
| CatBoost |
28.7 |
4,800 |
Medium |
78.9 |
Slow |
Medium |
| Stacking Ensemble |
52.4 |
2,100 |
High |
156.7 |
Very Slow |
High |
| Voting Ensemble |
38.2 |
3,500 |
Medium |
98.4 |
Slow |
Medium |
| Neural Ensemble |
45.9 |
2,800 |
Very High |
234.6 |
Very Slow |
Very High |
Table 8.
Pairwise Statistical Significance Comparison (p-values).
Table 8.
Pairwise Statistical Significance Comparison (p-values).
| Method |
Comparison Methods |
| |
RF |
XGB |
LGBM |
Cat |
Stack |
Vote |
Neural |
| Random Forest |
– |
0.032 |
0.087 |
0.156 |
<0.001 |
0.234 |
0.012 |
| XGBoost |
0.032 |
– |
0.445 |
0.089 |
<0.001 |
<0.001 |
<0.001 |
| LightGBM |
0.087 |
0.445 |
– |
0.178 |
<0.001 |
0.001 |
<0.001 |
| CatBoost |
0.156 |
0.089 |
0.178 |
– |
<0.001 |
0.067 |
<0.001 |
| Stacking |
<0.001 |
<0.001 |
<0.001 |
<0.001 |
– |
<0.001 |
<0.001 |
| Voting |
0.234 |
<0.001 |
0.001 |
0.067 |
<0.001 |
– |
0.045 |
| Neural |
0.012 |
<0.001 |
<0.001 |
<0.001 |
<0.001 |
0.045 |
– |
|
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).