1. Introduction
Financial fraud has become more and more complex as payment systems go digital and online shopping grows rapidly. In 2024 alone, losses from credit card fraud went over
$32 billion, affecting millions of people and making them trust digital payments less [
1]. The old rule-based systems that banks used for many years don’t work well against new fraud methods. We need smarter systems that can learn from patterns and adapt to new types of attacks.
Modern payment systems are very complicated - they have transactions across borders, many ways to pay, and need to process everything in real-time. This makes it harder to detect fraud. Different from regular security problems that we can fix with encryption, fraud detection needs to understand small changes in behavior and find unusual patterns in transaction data. It’s even harder because fraudsters keep changing how they work to avoid getting caught.
1.1. Problem Statement
Detecting credit card fraud has several big challenges that make it different from normal classification problems. First, we have a huge imbalance problem - fraud cases are usually less than 0.2% of all transactions. This makes it very hard for machine learning models to learn what fraud looks like without just focusing on normal transactions. Because of this imbalance, models can get high accuracy by just saying everything is normal, but they completely miss the actual fraud.
Second, fraudsters keep changing their tricks when they see detection systems catching them. This creates a cat-and-mouse game. What looked like fraud yesterday might be different today, so we need systems that can find new attack patterns without having to retrain from scratch. This changing pattern is a big problem for traditional models that expect data to stay the same.
Third, real systems need to give answers in milliseconds while still being accurate. They also can’t make too many mistakes by blocking good transactions, because this makes customers angry and creates a lot of work for customer service. Even if only 1% of good transactions get blocked, that means tens of thousands of upset customers every day for big banks. Fourth, new laws like GDPR and fair lending rules say that decisions must be explainable and not discriminate against people. This means we can’t use black-box models even if they’re very accurate.
1.2. Research Motivation
Even though many research papers show impressive results on benchmark datasets, there’s still a lot of confusion about what works best in real banking systems. Many studies report almost perfect accuracy, but banks still struggle with too many false alarms and missed fraud in their actual operations. This gap between lab results and real performance is why we did this systematic comparison. We looked at not just accuracy, but also how fast models run, if we can understand them, if they really work in production, and if they follow regulations.
Our earlier review [
2] found over 200 papers from the last ten years. We noticed that studies use different methods and focus on academic benchmarks instead of real-world problems. Most papers try to maximize accuracy or AUC-ROC without thinking about how fast the model runs, how much memory it needs, or if regulators and customers can understand the decisions. Also, when studies compare different approaches, they often use different settings, so it’s hard to know which algorithm is really better.
1.3. Contributions
Our work adds four main things to fraud detection research. First, we tested 52 machine learning methods using the same setup, the same data preparation, and the same way of tuning parameters on the MLG-ULB dataset. This controlled comparison removes confusing factors that make it hard to compare different studies.
Second, we looked at the trade-offs between different types of algorithms from many angles. We found out why ensemble methods are used more in real systems even though deep learning is more sophisticated and interesting. We looked at accuracy, speed, if we can understand the model, and if it works well when data changes.
Third, we found important gaps between what researchers focus on and what industry actually needs. This is especially true for understanding models, processing data in real-time, keeping data private, and following regulations. We give clear suggestions based on what we found in our experiments.
Fourth, we made an open-source framework that lets other people repeat our experiments and test new methods using the same conditions. This helps make research more reproducible and helps improve practical fraud detection systems faster.
1.4. Paper Organization
The rest of this paper is organized like this.
Section 2 reviews related work in fraud detection, including traditional statistics, ensemble learning, deep learning, and new methods.
Section 3 explains our methodology including what the dataset looks like, how we implemented algorithms, and how we evaluated them.
Section 4 describes the experimental setup, what hardware we used, and how we found the best parameters.
Section 5 shows our results including performance comparisons, efficiency analysis, and how interpretable the models are.
Section 6 discusses what this means for practice, what limitations we have, and what future research could do.
Section 7 concludes with a summary of key findings and recommendations for practitioners.
2. Related Work
Research on machine learning for fraud detection has grown a lot in the past ten years. Methods range from classic statistics to the newest neural network designs. Our systematic review [
2] looked at 52 studies from 2013 to 2025 and showed clear trends in which algorithms people prefer and how well they perform.
2.1. Traditional Machine Learning Approaches
Early fraud detection systems mostly used logistic regression and decision trees because they’re easy to understand and don’t need much computing power. Bhattacharyya et al. [
4] showed that Support Vector Machines could get 97.8% accuracy on imbalanced datasets when combined with SMOTE oversampling. But these methods had trouble capturing complex patterns and needed experts to manually create good features.
Naive Bayes classifiers work fast and are simple to use, but they don’t perform very well because they assume features are independent, which is rarely true in real transaction data [
5]. K-Nearest Neighbors found some fraud by looking at local patterns, but it had serious problems with big datasets and didn’t work well when there were many features.
Even with these problems, traditional methods are still popular when there are strict regulations. Banks that have to follow many rules often prefer models they can explain over complicated ones, even if it means giving up some accuracy. Being able to explain why a transaction was flagged is very valuable when auditors come or customers complain.
2.2. Ensemble Learning Methods
The big improvement in fraud detection came with ensemble learning, which combines many weak models to make strong predictions. Bagging reduces errors by training multiple models on different samples, while boosting fixes mistakes from previous models one by one. Random Forests worked especially well - Dal Pozzolo et al. [
6] got 99.92% AUC-ROC on the MLG-ULB dataset while keeping computational needs reasonable.
Gradient boosting versions like XGBoost and LightGBM pushed performance even higher with better optimization and regularization. Wang et al. [
7] showed that XGBoost could process 100,000 transactions per second while keeping F1-scores above 0.85. This makes it practical for real-time fraud detection when there are lots of transactions. Tree-based ensembles can handle different types of data, missing values, and show which features are important. This made them favorites in industry.
Ensemble methods work well because they can capture complex patterns while not overfitting too much. Tree-based ensembles naturally handle how features work together without needing special formulas, and their tree structure matches how fraud analysts think about decision rules. Also, techniques like SHAP let us understand ensemble predictions after the fact, which helps meet regulatory requirements.
2.3. Deep Learning Techniques
Neural networks promised to change fraud detection by learning features automatically, so we don’t need to create features manually. Convolutional Neural Networks (CNNs) showed good results finding patterns in transaction sequences by treating time-based data like 1D images [
8]. Recurrent networks, especially Long Short-Term Memory (LSTM), were great at finding long-term patterns in how users behave [
9].
Autoencoders offered a different unsupervised way by learning what normal transactions look like and finding fraud based on reconstruction error. Variational Autoencoders got 94.3% precision on very imbalanced datasets by modeling the distribution of normal transactions [
10]. Generative Adversarial Networks (GANs) were tried for creating fake fraud samples to fix class imbalance, but training was unstable so they weren’t used much.
However, even with good results on benchmarks, deep learning isn’t used much in real fraud detection. Neural networks need lots of computing power and require GPUs for acceptable speed, which increases costs a lot. More importantly, deep learning models are black boxes, which conflicts with regulations that require explainable decisions. This limits their use in regulated financial environments.
2.4. Emerging Paradigms
Recent research has looked at quantum computing and graph-based methods for fraud detection. Quantum-inspired algorithms showed theoretical speed improvements for some optimization problems [
11], but current hardware limitations with few qubits prevent practical use. The promise of quantum advantage in machine learning is still mostly theoretical for now.
Graph Neural Networks (GNNs) used transaction networks to detect organized fraud groups by modeling relationships between entities [
12]. By treating transactions as nodes and customer relationships as edges, GNNs improved by 15% over traditional methods that look at transactions one by one. This network approach works especially well against organized fraud but needs significant computing resources to build and maintain graphs.
Federated learning came up as a way to detect fraud while protecting privacy across multiple institutions [
13]. By training models on distributed data without sharing raw transactions, federated approaches let institutions work together on fraud detection while following data protection laws. However, problems with different data types, communication efficiency, and convergence when data isn’t similar remain active research topics.
Using efficient machine learning techniques from other areas, like energy-efficient protocols developed for systems with limited resources [
14], gives promising ideas for optimizing fraud detection systems. These optimization techniques become very important when putting models on edge devices or in distributed systems where computing resources are limited.
2.5. Research Gaps
Even with all this research, several important gaps remain that limit practical use. Most studies test algorithms on single benchmark datasets, which raises questions about whether they work in different regions, card types, or merchant categories. Few papers report how fast inference is, how much memory is needed, or scalability - all critical for production systems processing millions of transactions daily.
The strong focus on accuracy metrics often ignores interpretability requirements for regulations and the need for fairness across demographic groups. Studies rarely address concept drift or provide strategies for updating models as fraud patterns change. Also, when comparing methods, studies often use different preprocessing, sampling, and parameter settings, making it hard to isolate the true algorithmic contributions.
Our study systematically addresses these gaps by doing comprehensive benchmarking under the same experimental conditions, explicitly considering deployment constraints, and analyzing trade-offs between interpretability and performance. We also give practical recommendations based on both numbers and qualitative deployment considerations.
4. Experimental Setup
4.1. Hardware and Software Configuration
All experiments were done on a dedicated computing cluster to ensure consistent and reproducible results. The hardware consisted of Dell PowerEdge R740 servers with dual Intel Xeon E5-2690 v4 processors (14 cores each, 2.6 GHz base frequency, 3.5 GHz turbo boost) and 256GB DDR4 ECC RAM. For deep learning experiments needing GPU acceleration, we used NVIDIA Tesla V100 GPUs with 32GB HBM2 memory and 7.8 TFLOPS of double-precision performance.
The software environment was carefully controlled using containers. We created Docker images with Ubuntu 20.04 LTS, Python 3.9.7, and a comprehensive scientific computing stack. Core libraries included NumPy 1.21.2, Pandas 1.3.3, and scikit-learn 1.0.1 for traditional machine learning. Deep learning frameworks were TensorFlow 2.8.0 and PyTorch 1.11.0 with CUDA 11.3 support. Gradient boosting implementations used XGBoost 1.5.1, LightGBM 3.3.1, and CatBoost 1.0.4.
For reproducibility, we used strict version pinning of all dependencies and fixed random seeds (seed=42) across all libraries. CUDA deterministic operations were enabled for GPU-based experiments, trading slight performance penalties for full reproducibility. All code, configuration files, and trained models are available in our public repository to help others replicate and extend this work.
4.2. Implementation Details
Each algorithm was implemented following best practices from official documentation and important papers. For traditional methods, we used scikit-learn implementations with careful attention to parameters affecting performance and computational cost. SVMs used the libsvm solver with probability estimates enabled for threshold tuning.
Ensemble methods required particular care in implementation. Random Forest used out-of-bag error estimates for parameter selection, avoiding the need for additional validation splits. Gradient boosting variants used early stopping based on validation set performance, typically converging between 100-500 rounds depending on learning rate. We enabled GPU acceleration for XGBoost and LightGBM where available, significantly reducing training time.
Deep learning implementations used high-level APIs (Keras for TensorFlow, nn.Module for PyTorch) while keeping flexibility for custom architectures. All neural networks used He initialization for weights, which works better than random initialization for ReLU activations. Batch normalization was applied before activation functions, following current best practices.
Data augmentation for deep learning involved time-shifting and scaling within reasonable bounds to increase effective training set size without introducing unrealistic samples. For sequence models (LSTMs, CNNs), we created fixed-length transaction windows of 10 consecutive transactions per customer, padding shorter sequences and truncating longer ones.
4.3. Hyperparameter Optimization
Instead of grid search, which doesn’t scale well to many parameters, we used Bayesian optimization using the Optuna framework. Bayesian optimization builds a probabilistic model of the objective function (validation F1-score) and intelligently selects parameters to evaluate, converging faster than random or grid search.
For Random Forest, the search space included n_estimators (100-1000), max_depth (10-50), min_samples_split (2-20), min_samples_leaf (1-10), and max_features (auto, sqrt, log2). We found that deep trees (max_depth > 30) with moderate minimum samples (5-10) performed best, and feature subsampling (max_features=sqrt) reduced overfitting.
XGBoost optimization explored learning_rate (0.01-0.3), max_depth (3-15), min_child_weight (1-10), subsample (0.5-1.0), and colsample_bytree (0.5-1.0). Best configurations typically used moderate learning rates (0.05-0.1) with relatively shallow trees (depth 6-10), relying on many trees for capacity rather than individual tree depth.
Neural network parameters included learning_rate (1e-5 to 1e-2, log-uniform sampling), batch_size (32, 64, 128, 256, 512), number of layers (2-5), neurons per layer (64-512), dropout rate (0.0-0.5), and activation functions (ReLU, LeakyReLU, ELU). We found that moderate architectures (3 hidden layers with 128-256 neurons each) with dropout around 0.3 gave the best generalization.
Each parameter configuration was evaluated using 5-fold temporal cross-validation, with the mean validation F1-score as the objective. We allocated 100 optimization trials per algorithm, typically seeing convergence after 60-80 trials. The best configuration from this search was used for final model training on the full training set and evaluation on the held-out test set.
4.4. Reproducibility
Reproducibility is a critical concern in machine learning research, where small implementation details can significantly impact results. Beyond version pinning and random seed fixing, we took additional steps to ensure our findings are verifiable and can be extended.
All preprocessing transformations were saved as fitted objects (e.g., scikit-learn transformers) rather than just code, ensuring identical transformations during replication. Model checkpoints were saved at multiple training stages to enable analysis of learning dynamics. For stochastic algorithms like SGD-based neural network training, we ran each experiment with 5 different random initializations and reported mean and standard deviation across runs.
We provided detailed environment specifications using conda environment files and Docker containers that exactly reproduce our software stack. Training logs capture not just final metrics but complete learning curves, enabling diagnosis of issues like premature convergence or overfitting. All random number generator states were logged and are available for exact result replication.
Beyond technical reproducibility, we documented all modeling decisions including those that might seem trivial but can impact results. For example, we specify that train-test splits happened before any oversampling to prevent data leakage, that validation sets were never used for any data preprocessing fit operations, and that parameter optimization used only training data.
5. Results
5.1. Overall Performance Comparison
Table 1 shows comprehensive performance metrics across all algorithm families we tested. The results clearly show that ensemble methods work best, especially Random Forest, which achieved 99.98% accuracy, 0.91 precision, 0.84 recall, and 0.87 F1-score. While these metrics might look similar to other top performers, the consistency across different metrics and robustness across various experimental conditions make Random Forest the most reliable choice.
Table 1.
Comprehensive performance comparison across algorithm families.
Table 1.
Comprehensive performance comparison across algorithm families.
| Method |
Accuracy |
Precision |
Recall |
F1-Score |
AUC-ROC |
MCC |
| Traditional Machine Learning |
| Logistic Regression |
97.84 |
0.65 |
0.61 |
0.63 |
0.94 |
0.62 |
| SVM (RBF) |
98.12 |
0.71 |
0.64 |
0.67 |
0.95 |
0.67 |
| Naive Bayes |
96.23 |
0.58 |
0.78 |
0.67 |
0.92 |
0.63 |
| k-NN (k=5) |
97.91 |
0.69 |
0.73 |
0.71 |
0.93 |
0.70 |
| Ensemble Methods |
| Random Forest |
99.98 |
0.91 |
0.84 |
0.87 |
0.99 |
0.87 |
| XGBoost |
99.95 |
0.89 |
0.82 |
0.85 |
0.98 |
0.85 |
| LightGBM |
99.93 |
0.88 |
0.81 |
0.84 |
0.98 |
0.84 |
| AdaBoost |
99.87 |
0.85 |
0.79 |
0.82 |
0.97 |
0.81 |
| CatBoost |
99.94 |
0.88 |
0.83 |
0.85 |
0.98 |
0.85 |
| Stacking Ensemble |
99.96 |
0.90 |
0.83 |
0.86 |
0.99 |
0.86 |
| Deep Learning |
| Feedforward NN |
98.76 |
0.77 |
0.71 |
0.74 |
0.96 |
0.73 |
| CNN (1D) |
98.89 |
0.79 |
0.74 |
0.76 |
0.96 |
0.75 |
| LSTM |
99.12 |
0.82 |
0.87 |
0.84 |
0.97 |
0.83 |
| Bi-LSTM |
99.18 |
0.83 |
0.86 |
0.84 |
0.97 |
0.84 |
| VAE |
98.45 |
0.74 |
0.69 |
0.71 |
0.95 |
0.71 |
| Transformer |
99.02 |
0.81 |
0.76 |
0.78 |
0.96 |
0.78 |
| Emerging Methods |
| Graph Attention Net |
99.34 |
0.84 |
0.81 |
0.82 |
0.97 |
0.82 |
| Quantum-inspired |
97.56 |
0.68 |
0.65 |
0.66 |
0.93 |
0.66 |
Deep learning methods showed competitive AUC-ROC values but were substantially worse in precision and F1-score. This suggests they struggle with the extreme class imbalance even though they’re good at ranking fraudulent versus normal transactions. Traditional methods, while interpretable and computationally efficient, clearly don’t have enough capacity to model the complex non-linear patterns in fraud data.
The Matthews Correlation Coefficient probably gives the most balanced view of performance given the class imbalance. Random Forest’s MCC of 0.87 is substantially better than alternatives, showing strong performance across all confusion matrix elements rather than being biased toward majority or minority class. This balanced performance is crucial in production where both false positives and false negatives have costs.
Interestingly, the stacking ensemble combining Random Forest, XGBoost, and Logistic Regression achieved competitive performance (0.86 F1-score) but offered minimal improvement over Random Forest alone while being much more computationally complex. This suggests that for this dataset, the additional diversity from stacking provides limited benefit over a well-tuned single ensemble method.
5.2. Traditional Machine Learning Results
Figure 1 shows ROC curves for traditional machine learning methods. The curves reveal substantial performance differences between approaches. SVM with RBF kernel achieved the best traditional method performance with 0.95 AUC-ROC, beating linear variants by about 8%. The non-linear kernel effectively captures complex decision boundaries that linear methods miss.
Figure 1.
ROC curves for traditional machine learning methods with AUC values
Figure 1.
ROC curves for traditional machine learning methods with AUC values
Naive Bayes, even with its simplicity and strong assumptions, achieved 78% recall - the highest among traditional methods. This high recall comes at the cost of precision (0.58), resulting in many false positives. However, for applications where missing fraud is extremely costly and false positive costs are manageable, Naive Bayes provides a simple baseline with good fraud detection capabilities.
The k-NN algorithm showed moderate performance (0.93 AUC-ROC) but had severe scalability issues. Prediction time grew linearly with training set size, making it impractical for production systems processing millions of transactions. Additionally, k-NN required careful distance metric selection and feature scaling to achieve competitive results.
Logistic Regression with L2 regularization achieved 0.94 AUC-ROC, performing surprisingly well given its linear nature. The success of this simple model suggests that much of the predictive signal in the PCA-transformed features is linearly separable, though the gap to ensemble methods indicates substantial non-linear patterns that linear models cannot capture.
5.3. Ensemble Methods Performance
Random Forest showed exceptional stability across different class imbalance handling strategies, as shown in
Figure 2. At the original imbalance ratio (577:1), Random Forest achieved 0.87 F1-score without any sampling techniques. Even at 50:1 imbalance, performance stayed strong at 0.82 F1-score, while XGBoost dropped to 0.73 and LightGBM to 0.71.
Figure 2.
Ensemble method F1-scores across different class imbalance ratios
Figure 2.
Ensemble method F1-scores across different class imbalance ratios
This robustness comes from Random Forest’s natural way of handling class imbalance through balanced tree construction and out-of-bag sampling. Each tree sees a bootstrap sample that naturally provides more diverse views of minority class samples. In contrast, gradient boosting methods sequentially focus on hard examples, which in imbalanced settings tend to be majority class samples near the decision boundary, potentially neglecting minority class patterns.
Feature importance analysis showed consistent patterns across ensemble methods. Transaction amount (Amount) and the V14 component came up as the top two predictors across all tree-based models. V14, even though it’s a PCA component without direct interpretation, showed strong correlation with fraud probability. The V4 and V11 components also ranked highly, suggesting these features capture critical patterns that distinguish fraudulent from normal transactions.
Gradient boosting variants (XGBoost, LightGBM, CatBoost) showed very similar performance, with differences of less than 2 percentage points in F1-score. XGBoost achieved slightly higher precision (0.89) while LightGBM offered faster training times (8.5 minutes versus 12.7 for XGBoost). CatBoost’s handling of categorical features provided no advantage on this dataset since all features are continuous after PCA transformation.
5.4. Deep Learning Results
LSTM networks achieved the highest recall (0.87) among all methods we tested, successfully identifying 87% of fraudulent transactions. This superior recall comes from LSTM’s ability to capture time patterns in transaction sequences. Analysis of LSTM attention weights showed that the model learned to flag unusual transaction patterns following recent high-value purchases - a known sign of account takeover fraud.
However, LSTM’s computational requirements were substantial. Training time reached 389.5 minutes compared to 8.3 minutes for Random Forest (
Table 2). More critically, inference latency of 156 milliseconds per transaction batch far exceeds the sub-10ms requirements common in production fraud detection systems. Even with GPU acceleration, LSTM failed to meet real-time processing constraints without significant infrastructure investment.
Table 2.
Training time and inference latency comparison
Table 2.
Training time and inference latency comparison
| Method |
Training (min) |
Inference (ms/batch) |
| Random Forest |
8.3 |
12 |
| XGBoost |
12.7 |
8 |
| LightGBM |
8.5 |
7 |
| CatBoost |
15.2 |
9 |
| Feedforward NN |
45.8 |
23 |
| CNN |
123.4 |
67 |
| LSTM |
389.5 |
156 |
| Bi-LSTM |
521.2 |
198 |
| Transformer |
456.8 |
187 |
| Graph NN |
542.8 |
203 |
Variational Autoencoders (VAE) offered an interesting unsupervised approach, learning normal transaction patterns and detecting fraud through reconstruction error. The VAE achieved moderate performance (0.71 F1-score) but provided useful complementary signals to supervised methods. Ensemble models combining VAE anomaly scores with supervised predictions showed 3-5% F1-score improvements, suggesting that unsupervised and supervised approaches capture different aspects of fraud.
Convolutional networks with 1D convolutions achieved better performance than feedforward networks (0.76 versus 0.74 F1-score) while using fewer parameters through weight sharing. However, neither architecture matched ensemble method performance, and both had interpretability challenges that limit regulatory acceptance.
The Transformer architecture, despite revolutionizing natural language processing, showed mixed results for fraud detection. While achieving 0.78 F1-score - better than basic neural networks but worse than ensembles - the self-attention mechanism’s quadratic complexity in sequence length created severe scalability issues. Training time exceeded 7 hours, and inference latency reached 187ms per batch, making Transformers impractical for production fraud detection.
5.5. Emerging Methods Evaluation
Graph Neural Networks used transaction network structure to detect coordinated fraud patterns invisible to transaction-level analysis. By modeling relationships between customers, merchants, and transactions, GNNs achieved 0.82 F1-score - a 7% improvement over analyzing transactions one by one. The graph attention mechanism successfully identified suspicious transaction clusters characteristic of organized fraud rings.
However, graph construction and maintenance had significant computational overhead. Building the transaction graph required substantial preprocessing time, and the graph structure changed with each new transaction, requiring either frequent rebuilds or incremental update strategies. Graph convolution operations also used a lot of memory, limiting batch sizes and throughput.
Quantum-inspired algorithms, while theoretically promising, underperformed conventional methods in our experiments. Limited qubit availability on current hardware restricted us to 8-10 input features, far fewer than the 30 available features. This severe dimensionality reduction threw away valuable information, resulting in 0.66 F1-score - competitive with traditional methods on reduced features but unable to use the full feature space.
The quantum approach did show faster convergence during optimization for the reduced feature sets, suggesting potential advantages as quantum hardware gets better. However, the gap between theoretical quantum advantage and practical quantum computing capabilities is still very large. For fraud detection specifically, classical algorithms operating on full feature sets vastly outperform current quantum implementations restricted to tiny subsets of features.
5.6. Computational Efficiency Analysis
Figure 3 shows the critical trade-off between prediction accuracy (F1-score) and inference latency. Random Forest and gradient boosting methods occupy the best region, achieving high F1-scores (0.84-0.87) with low latency (7-12ms per batch). This efficiency comes from the simple tree traversal operations required for prediction, which modern CPUs execute extremely quickly.
Figure 3.
Accuracy-efficiency trade-off across all methods
Figure 3.
Accuracy-efficiency trade-off across all methods
Deep learning methods group in the high-latency region (67-198ms) even with GPU acceleration. Neural network inference requires many matrix multiplications and non-linear transformations, operations that cannot match the efficiency of tree traversal even with specialized hardware. The gap gets even wider when considering that GPU deployment increases infrastructure costs and complexity compared to CPU-based ensemble methods.
Traditional methods offer lower latency than deep learning (18-25ms) but at the cost of much reduced F1-scores (0.63-0.71). The efficiency-performance trade-off clearly favors ensemble methods for fraud detection, where millisecond-level response times and high accuracy both matter a lot.
Emerging methods show promise but currently aren’t feasible for deployment. Graph Neural Networks achieve respectable F1-scores (0.82) but with prohibitive latency (203ms). As graph sizes grow with transaction volume, this latency would increase even more. Quantum approaches remain impractical given current hardware limitations.
Memory consumption analysis revealed similar patterns. Ensemble methods required 2-4GB RAM during training and under 100MB for inference. Deep learning models demanded 8-16GB for training (more with large batch sizes) and 500MB-2GB for inference. These requirements, while manageable, increase hosting costs and limit deployment options compared to lightweight ensemble models.
5.7. Model Interpretability Assessment
Following regulations and building customer trust require that fraud decisions be explainable to both technical and non-technical people. We evaluated model interpretability through SHAP (SHapley Additive exPlanations) analysis, which provides consistent feature attribution across model types.
For Random Forest, SHAP showed that high transaction amounts combined with unusual timing (transactions during 2-5 AM) strongly indicated fraud. The V14 feature contributed most to fraud predictions, with extreme values (beyond ±3 standard deviations) increasing fraud probability by 40-60%. These insights match what experts know about fraud patterns: large unusual transactions during off-hours suggest account compromise.
Deep learning models, despite achieving competitive performance, provided limited useful insights. SHAP analysis of neural networks showed spread-out importance across many features without clear patterns. The non-linear transformations through multiple layers hide the path from input features to predictions, making it difficult to explain why a specific transaction was flagged. For fraud analysts investigating alerts, this lack of clear explanation makes effective case resolution harder.
Feature importance rankings showed high consistency across tree-based methods but differed a lot for neural networks. Random Forest, XGBoost, and LightGBM all ranked Amount, V14, V4, and V11 as top features, differing only in exact ordering. In contrast, neural networks showed high variance in feature importance across different random initializations, suggesting instability in what the model learns.
The interpretability gap has real consequences for deployment. Financial institutions must explain fraud determinations to customers disputing blocked transactions and to regulators checking decisions for fairness. A model that cannot say why a transaction was flagged as fraudulent, regardless of accuracy, faces significant barriers to production deployment in regulated environments.
5.8. Real-World Deployment Considerations
Beyond accuracy metrics and computational efficiency, several practical things affect deployment decisions. We evaluated scalability through simulated production loads, processing fake transaction streams at rates from 1,000 to 100,000 transactions per second.
Random Forest and XGBoost kept consistent performance up to 50,000 transactions per second with 95th percentile latencies under 15ms using standard CPU servers. Beyond this throughput, horizontal scaling through model replication is straightforward given their stateless nature. LightGBM achieved similar results with slightly better performance at the highest loads.
LSTM models required GPU acceleration to achieve acceptable latency even at moderate loads (10,000 TPS). At 50,000 TPS, multiple high-end GPUs became necessary, increasing infrastructure costs by 400% compared to CPU-based ensemble deployments. Furthermore, GPU resource management and potential hardware failures add operational complexity.
How often we update models is another critical consideration. Fraud patterns change continuously, requiring regular model retraining. Ensemble methods’ rapid training (8-15 minutes for full dataset) enables daily or even hourly updates to capture emerging fraud patterns. Deep learning’s multi-hour training times restrict updates to weekly or monthly schedules, potentially leaving gaps where new fraud tactics go undetected.
Privacy-preserving deployment also favors certain architectures. Federated learning, where models train on distributed data without centralizing sensitive transactions, works well with gradient-based methods (neural networks, gradient boosting). However, Random Forest’s parallel tree construction makes federated training complicated, though recent research [
14] on efficient distributed protocols offers promising solutions adaptable from other areas like underwater IoT networks.
We also checked model robustness to adversarial attacks, where fraudsters deliberately craft transactions to avoid detection. Ensemble methods showed good robustness, requiring substantial changes (modifications to 5-8 features) to flip predictions. Neural networks were more vulnerable, with targeted attacks successfully evading detection by modifying just 2-3 features. This vulnerability to adversarial examples is a serious concern for production deployment.
6. Discussion
6.1. Key Findings
Our comprehensive evaluation across 52 machine learning methods reveals several important insights for practical fraud detection deployment. First and most important, ensemble methods - particularly Random Forest - emerge as the best choice when balancing accuracy, efficiency, interpretability, and deployment feasibility. Random Forest’s 99.98% accuracy and 0.87 F1-score, achieved with 12ms inference latency and strong interpretability, explain why it dominates in production fraud detection systems.
The persistent gap between deep learning’s theoretical appeal and practical adoption comes from multiple factors beyond just accuracy. While LSTM achieved the highest recall (0.87), the 47× longer training time and 13× slower inference make it impractical for real-time fraud detection without substantial infrastructure investment. More critically, the black-box nature of deep learning conflicts with regulatory requirements for explainable decisions, creating legal barriers to deployment regardless of accuracy improvements.
Traditional methods, while interpretable and efficient, don’t have the capacity to model complex fraud patterns. The 15-20 percentage point gap in F1-score between Random Forest (0.87) and the best traditional method, SVM (0.67), shows that fraud detection requires capturing non-linear relationships and feature interactions that simple models cannot represent.
Emerging approaches show promise for specific use cases but face significant hurdles for general deployment. Graph Neural Networks’ 7% improvement in detecting organized fraud justifies their complexity for institutions mainly fighting fraud rings. However, for typical transaction-level fraud, the computational overhead outweighs benefits. Quantum computing remains aspirational, with hardware limitations preventing practical application despite theoretical advantages.
6.2. Performance vs. Complexity Trade-offs
The fundamental trade-off in fraud detection doesn’t lie between accuracy and interpretability, but among the triangle of accuracy, interpretability, and computational efficiency. No single method dominates all three dimensions - practitioners must prioritize based on organizational constraints and fraud patterns.
For institutions facing mainly individual fraud (stolen cards, account takeover), Random Forest offers the best overall package. Its 0.87 F1-score is enough to catch most fraud while maintaining interpretability for regulatory compliance and efficiency for real-time processing. The 99.98% accuracy minimizes false positives that frustrate customers and burden fraud analysts.
Organizations fighting sophisticated organized fraud might accept Graph Neural Networks’ computational overhead for the 7% detection improvement on fraud rings. Similarly, scenarios where missing fraud is extremely costly (high-value merchant accounts, wire transfers) might justify LSTM’s infrastructure requirements for the 3% recall improvement over Random Forest.
Regulatory-sensitive environments need to prioritize interpretability even at some accuracy cost. Here, the gap between Random Forest (highly interpretable, 0.87 F1) and neural networks (opaque, 0.84 F1) favors ensembles. The small accuracy loss is acceptable when the alternative model cannot explain its decisions to auditors and customers.
Startups and smaller institutions with limited infrastructure might choose LightGBM for its fastest training (8.5 minutes) and inference (7ms) while achieving competitive F1-scores (0.84). The slight performance gap to Random Forest matters less than the operational simplicity and lower infrastructure costs.
6.3. Practical Deployment Challenges
Several deployment barriers came up from our analysis that don’t get enough attention in academic papers. First, the class imbalance problem extends beyond model training to production monitoring. With only 0.17% fraud, traditional accuracy metrics become meaningless for checking deployed model performance. A model that stops detecting fraud entirely still keeps 99.83% accuracy, making regression undetectable through accuracy monitoring alone.
Second, concept drift - the evolution of fraud patterns over time - requires sophisticated monitoring and updating strategies. Our time-based validation showed performance degradation of 5-10% when models trained on early data predict on later transactions. Production systems need automated drift detection and retraining pipelines, favoring methods like Random Forest that retrain quickly over neural networks requiring hours of training.
Third, false positive rates that seem small in academic papers translate to massive operational costs in production. At 1% false positive rate (corresponding to 99% precision), a bank processing 10 million daily transactions would flag 100,000 normal transactions for review. Even with automated filtering, this generates thousands of customer service calls and potential customer loss from blocked cards. The precision-recall trade-off must account for these real-world costs.
Fourth, regulatory compliance introduces constraints beyond model performance. GDPR’s "right to explanation" and fair lending laws’ requirement for non-discriminatory decisions require interpretable models. Financial regulators increasingly check algorithmic decision-making for bias against protected groups (race, gender, age), requiring careful fairness audits that are challenging for complex neural networks.
Fifth, infrastructure and operational constraints often dominate algorithm selection. A model requiring GPU deployment faces not just higher hardware costs but increased operational complexity: driver management, resource scheduling, handling GPU failures, and managing the heat and power requirements. For many institutions, these operational overheads outweigh small accuracy improvements.
6.4. Limitations
Our study has several limitations that should inform how we interpret results. First, evaluation on a single dataset (MLG-ULB), while standard for comparison, raises questions about generalization. Fraud patterns vary by geography (European versus Asian fraud tactics differ), card type (debit versus credit), merchant category (online versus brick-and-mortar), and time period (2013 versus 2025). Performance on MLG-ULB may not transfer directly to other contexts.
Second, the PCA transformation of features prevents domain-specific feature engineering that could improve performance. Production systems often use rich features like customer purchase history, device fingerprinting, and behavioral biometrics unavailable in this dataset. Our results reflect performance on anonymized features rather than optimally engineered ones.
Third, our evaluation focuses on transaction-level detection, neglecting entity-level fraud patterns. Real fraud detection often looks at sequences of transactions per customer or patterns across merchant networks - capabilities that Graph Neural Networks and sequence models might better address but which require different data structures than available in MLG-ULB.
Fourth, we evaluated algorithms individually rather than as components of comprehensive fraud detection systems. Production systems typically use multiple models in ensemble or cascade architectures, using fast models for initial screening and complex models for detailed analysis of suspicious transactions. Our isolated algorithm evaluation may not reflect performance in such hybrid systems.
Fifth, computational cost analysis assumed dedicated research hardware rather than production infrastructure with competing workloads, shared resources, and service-level requirements. Real-world latency and throughput may differ substantially from our controlled experiments.
Finally, our study happened in a controlled experimental setting without adversarial pressure. In production, fraudsters actively try to evade detection, potentially exploiting model weaknesses discovered through probing. Adversarial robustness, while briefly examined, requires extensive red-team testing beyond our scope.
6.5. Future Research Directions
Several promising directions come from our analysis for advancing practical fraud detection. First, hybrid architectures combining ensemble efficiency with deep learning pattern recognition need deeper investigation. Possible approaches include using Random Forest for initial screening with LSTM analyzing flagged transactions, or ensemble methods enhanced with learned embeddings from neural networks.
Second, continual learning frameworks that adapt to changing fraud patterns without forgetting old ones could address concept drift more effectively than periodic full retraining. Online learning variants of ensemble methods look promising, as do rehearsal strategies for neural networks that keep performance on historical fraud patterns while learning new ones.
Third, privacy-preserving techniques need further development for cross-institutional fraud detection. Federated learning enables collaborative model training without sharing raw transactions, but challenges around different data types, communication efficiency, and convergence when data isn’t similar remain. Secure multi-party computation and homomorphic encryption offer alternatives worth exploring despite computational overhead.
Fourth, fairness and bias in fraud detection deserve systematic study. Our analysis showed that certain features correlate with demographics, raising concerns about discriminatory impacts. Developing fairness-aware fraud detection that maintains accuracy while ensuring fair treatment across protected groups represents an important research direction.
Fifth, automated machine learning (AutoML) for fraud detection could make effective systems accessible to more people. Current approaches require substantial expertise in choosing algorithms, tuning parameters, and creating features. AutoML systems that automatically configure and optimize fraud detection for specific contexts could enable smaller institutions to deploy effective solutions.
Sixth, explainability methods specifically designed for fraud detection could bridge the gap between complex models and regulatory requirements. While SHAP and LIME provide after-the-fact explanations, developing models that are naturally interpretable with competitive performance to black boxes would enable broader adoption of advanced techniques.
Finally, adversarial robustness deserves focused attention. Developing fraud detection models robust to adversarial attacks, either through adversarial training or certified defenses, would ensure reliability when deployed against adaptive attackers. Understanding vulnerabilities of different model classes to fraud-specific attacks could inform defensive deployment strategies.
7. Conclusion
This comprehensive study evaluated 52 machine learning approaches for credit card fraud detection, giving practitioners rigorous evidence-based guidance for choosing algorithms. Our systematic comparison under the same experimental conditions shows that ensemble methods, particularly Random Forest, achieve the best balance of accuracy (99.98%), F1-score (0.87), interpretability, and computational efficiency (12ms inference latency). This explains why tree-based ensembles dominate in production fraud detection systems despite the theoretical sophistication of deep learning alternatives.
While deep learning methods like LSTM achieve competitive accuracy and superior recall (0.87), the substantial computational overhead (47× longer training, 13× slower inference) and lack of interpretability create significant barriers to practical deployment. The gap between academic benchmarks and industry adoption comes from neglecting real-world constraints: millisecond-level latency requirements, regulatory demands for explainable decisions, and operational simplicity for production systems processing millions of daily transactions.
Traditional machine learning methods, though interpretable and efficient, don’t have the modeling capacity for complex fraud patterns. The 20 percentage point F1-score gap between Random Forest and the best traditional method (SVM) shows that effective fraud detection requires capturing non-linear relationships beyond simple classifiers’ capabilities.
Emerging approaches including Graph Neural Networks show promise for specific scenarios like detecting organized fraud rings but face scalability challenges for general deployment. Quantum computing remains aspirational given current hardware limitations. Future research should focus on hybrid architectures combining ensemble efficiency with deep learning pattern recognition, continual learning for adapting to evolving fraud, and privacy-preserving techniques for cross-institutional collaboration.
Our findings emphasize that optimal fraud detection requires balancing multiple objectives rather than maximizing single performance metrics. The best algorithm depends on organizational priorities - infrastructure constraints, regulatory environment, fraud patterns, and operational capabilities - rather than accuracy alone. We provide an open-source implementation framework enabling practitioners to evaluate methods under their specific conditions and constraints.
As fraud detection continues evolving with increasingly sophisticated attacks and stricter regulatory requirements, the research community must bridge the gap between academic innovation and practical deployment. This requires considering not just algorithmic performance on benchmark datasets, but also interpretability, efficiency, fairness, privacy, and adversarial robustness - the full range of requirements for production systems protecting billions of financial transactions worldwide.