Submitted:
04 January 2026
Posted:
07 January 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. An Overview of the Key Contributions
- Unified pipeline, by combining XAI (LIME) and conformal prediction for joint interpretability and uncertainty quantification.
- Conditional explanation generation, which reduces computational overhead by focusing on flagged or uncertain transactions.
- Risk-stratified workflow routing, for efficient investigation management and compliance accountability.
- Handling extreme class imbalance, which is typical in fraud datasets, with empirical evaluation on stratified transaction types.
1.2. Relationship to the Existing Research
1.3. Advantages Over Prior Work, Limitations, and Trade-Offs
- ITCF requires a held-out calibration set, potentially reducing training data.
- The framework currently uses LIME for explainability, even though LIME could at times have a challenge of instability. Potential future integration of more stable XAI methods is envisaged.
- Computational overhead of LIME perturbations may challenge strict real-time constraints.
- Currently, the coverage guarantees of the ITCF are marginal, not per class. Therefore, class-conditional methods will be explored in future work.
- Two ML ensemble models, namely Random Forest and XGBoost, were selected over the Deep Learning (DL) architectures because of the computational efficiency of these ensemble models. Additionally, these ensemble models were selected over the DL architectures because the former tend to outperform the latter on tabular data [31,32].
1.4. Outline of the Paper
2. Materials and Methods
2.1. Integrated XAI + Uncertainty Quantification Pipeline
- Conformal prediction identifies when the model is uncertain.
- LIME explains why uncertainty exists.
- Cases with entropy are flagged for review.
- Explanations guide targeted analyst interventions.
2.2. Designing and Developing the Integrated Transparency and Confidence Framework (ITCF)
Identification of High Uncertainty Cases
- Empty Regions: These occur when neither class’s non-conformity score is below the calculated quantile threshold. In such instances, the prediction region is empty, meaning the model cannot assign any label with the specified confidence. For both Random Forest and XGBoost, approximately of the test cases resulted in empty regions (e.g., Random Forest: cases, XGBoost: cases). These are the most uncertain cases, as the model explicitly signals its inability to make a confident prediction. In a fraud detection context, these would be flagged for manual review.
- High Entropy: Entropy, calculated from the predicted probabilities asis another measure of uncertainty. Higher entropy values indicate that the model’s probability distribution across classes is more uniform, signifying less confidence in a single prediction. Instances with an entropy greater than a certain threshold (e.g., as used in the integrated XAI dashboard) are considered high-risk or highly uncertain. These cases often correspond to situations where the model is almost equally likely to predict fraud or non-fraud.
2.3. Dataset and Preprocessing
2.4. Models and Training
2.5. Expainability - LIME
2.6. Uncertainty Quantification (Split Conformal Prediction)
2.7. Evaluation Metrics
3. Results and Discussion
3.1. Predictive Performance of Random Forest and XGBoost
3.2. Uncertainty Quantification and Calibration
3.3. Interpretability with LIME Explanations
3.4. Effectiveness of the Proposed Integrated Transparency and Confidence Framework
3.4.1. Rigorous Identification of High-Uncertainty Transactions
3.4.2. LIME Explanations on High-Uncertainty Cases
3.4.3. Targeted Human Review and Efficient Analyst Resource Allocation
3.4.4. Statistically Guaranteed Uncertainty and Regulatory Alignment
- Systematically identifying and escalating high-risk cases;
- Enhancing interpretability in challenging scenarios;
- Enabling efficient, targeted human review;
- Providing reliable, auditable uncertainty and explanation metrics
3.5. Split Conformal Prediction Implementation
- Proper Training Set ( of original ): Used to train the base models (Random Forest and XGBoost) to predict the target variable (isFraud). This ensures that the model’s parameters are learned independently of the calibration and test sets.
- Calibration Set ( of original ): Used to compute non-conformity scores for a held-out dataset. These scores are crucial for determining the quantile threshold that will later define the prediction regions for new, unseen data. This set allows the calculation of a valid quantile without peeking at the test data.
- Test Set: The entirely unseen dataset used to evaluate the final model and conformal prediction framework, ensuring an unbiased assessment of coverage and prediction set sizes.
3.5.1. Non-Conformity Scores and Prediction Regions
3.5.2. Target Confidence Level
3.6. Dashboard Visualization and Operational Impact
3.6.1. Key Features of the ITCF Dashboard
- Unified View: Presents at-a-glance metrics such as accuracy, precision, recall, F1-score, and AUC for model performance tracking (see Table 2)
- Uncertainty Heatmaps: Visualize prediction set sizes, conformal coverage, and entropy distributions, enabling users to quickly identify clusters of high-uncertainty transactions (see Table 5). This supports early flagging of cases that require further attention.
- Feature Attribution Panels: Aggregated and instance-level LIME explanations are displayed, highlighting which features most influence the model’s predictions for both typical and ambiguous cases. Figure 1 shows an example of aggregated feature importance for fraud detection.
- Analyst Triage Workflow: The dashboard generates prioritized queues for analyst review, automatically escalating transactions with high uncertainty or ambiguous model explanations. Table 6 outlines the review pathways enabled by the ITCF.
- Regulatory and Audit Log: Every flagged transaction and its associated explanations and uncertainty scores are logged, ensuring full traceability and compliance with regulatory requirements.
3.6.2. Operational Impact
- Proactive Risk Management: Identifying high-risk and uncertain transactions, timely intervention, and reducing the likelihood of undetected fraud.
- Explainability for Human Analysts: Alignment of model explanations with known fraud behavior patterns builds trust and helps analysts understand the rationale behind each flagged case.
- Efficient Resource Allocation: Calibrated uncertainty regions enable accurate prioritization, so analysts can focus on the most uncertain cases instead of reviewing every model alert. This improves efficiency and the quality of decisions.
- Regulatory Readiness: The dashboard’s combined interpretability and uncertainty features offer the transparency, traceability, and auditability needed in regulated financial settings, supporting compliance and external audits.
3.7. Comparison of the ITCF with Prior Research Work
3.8. Key Insights
- First integration of LIME + Conformal Prediction in fraud detection: The proposed framework is the first to combine LIME and Conformal Prediction for financial fraud detection, directly filling a gap in the current research.
- Extreme imbalance: The PaySim dataset is highly unbalanced with a ratio of , which is more or much harder than datasets used in previous CP studies (e.g., ).
- Fraud-specific uncertainty insights: Fraud cases showed much higher mean entropy ( for XGBoost) and are a major reason for the empty prediction regions, indicating that the model is unsure and needs human review or intervention.
- Accountability workflow demonstration: The framework shows how uncertainty quantification can be used in practice by sending high-uncertainty cases to human analysts, thereby improving oversight and responsibility.
Comparing Fraud vs. Non-Fraud Uncertainty for XGBoost
- Fraud Cases: Showed a higher mean entropy () and a lower mean maximum probability () compared to non-fraud cases. This indicates that fraud predictions are inherently more uncertain than non-fraud predictions, which is common in highly imbalanced datasets where the minority class is harder to predict with high confidence.
- Non-Fraud Cases: Exhibited very low mean entropy () and very high mean maximum probability (), confirming the model’s high confidence in predicting the majority non-fraud class.
3.9. Quantile Thresholds for Non-Conformity Scores
3.10. Prediction Regions and Coverage
3.11. Uncertainty Measures by Class (XGBoost)
3.12. Examples of High and Low Uncertainty
3.12.0.1. High-Uncertainty Fraud Cases
3.12.0.2. Low-Uncertainty Fraud Cases
3.12.0.3. High-Uncertainty Non-Fraud Cases
3.13. Summary of Model Performance
- a classification prediction;
- a conformal prediction region quantifying uncertainty; and
- a LIME explanation for interpretability.
4. Conclusions
- –
- Accurate fraud detection with high coverage using advanced ensemble models (XGBoost and Random Forest);
- –
- Systematic identification and prioritization of transactions with high uncertainty, lowering the risk of overconfident mistakes;
- –
- Clear, case-specific explanations that help analysts make better decisions;
- –
- Efficient use of investigative resources by focusing expert attention on the most uncertain and high-risk cases;
- –
- Complete transparency and auditability, supporting compliance with strict financial regulations.
Funding
Data Availability Statement
Abbreviations
| XAI | Explainable Artificial Intelligence |
| LIME | Local Interpretable Model-agnostic Explanations |
| CP | Conformal Prediction |
| UQ | Uncertainty Quantification |
| RF | Random Forest |
| XGB | XGBoost (Extreme Gradient Boosting) |
| AUC | Area Under the Curve |
| ROC | Receiver Operating Characteristic |
| NC | Non-Conformity |
| ML | Machine Learning |
| AI | Artificial Intelligence |
| TP | True Positive |
| TN | True Negative |
| FP | False Positive |
| FN | False Negative |
| AML | Anti-Money Laundering |
| Dtrain | Training Dataset |
| Dcalib | Calibration Dataset |
| Dtest | Test Dataset |
| FSCA | Financial Sector Conduct Authority |
| PA | Prudential Authority |
| ITCF | Integrated Transparency and Confidence Framework |
| SHAP | SHapley Additive ExPlanations |
References
- Goecks, L.S.; Korzenowski, A.L.; Gonçalves Terra Neto, P.; de Souza, D.L.; Mareth, T. Anti-money laundering and financial fraud detection: A systematic literature review. Intelligent Systems in Accounting, Finance and Management 2022, 29, 71–85. [Google Scholar] [CrossRef]
- Levi, M. Money for Crime and Money from Crime: Financing Crime and Laundering Crime Proceeds. European Journal on Criminal Policy and Research 2017, 23, 339–350. [Google Scholar] [CrossRef]
- Unger, B. The Scale and Impacts of Money Laundering; Edward Elgar Publishing, 2013. [Google Scholar]
- Ajagbe, S.A.; Majola, S.; Mudali, P. Comparative analysis of machine learning algorithms for money laundering detection. Discover Artificial Intelligence 2025, 5, 144. [Google Scholar] [CrossRef]
- Ngai, E.W.T.; Hu, Y.; Wong, Y.H.; Chen, Y.; Sun, X. The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature. Decision Support Systems 2011, 50, 559–569. [Google Scholar] [CrossRef]
- West, J.; Bhattacharya, M. Intelligent financial fraud detection: A comprehensive review. Computers & Security 2016, 57, 47–66. [Google Scholar] [CrossRef]
- Breiman, L. Random forests. Machine Learning 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016. [Google Scholar]
- Dal Pozzolo, A.; Boracchi, G.; Caelen, O.; Alippi, C.; Bontempi, G. Credit card fraud detection: A realistic modeling and a novel learning strategy. IEEE Transactions on Neural Networks and Learning Systems 2015, 29, 3784–3797. [Google Scholar] [CrossRef]
- Doshi-Velez, F.; Kim, B. Towards a rigorous science of interpretable machine learning. arXiv 2017, arXiv:1702.08608. [Google Scholar] [CrossRef]
- Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. In Proceedings of the Advances in Neural Information Processing Systems, 2017; Vol. 30. [Google Scholar]
- Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016. [Google Scholar]
- Commission, E. Proposal for a Regulation on Artificial Intelligence (Artificial Intelligence Act). 2021. [Google Scholar]
- Samek, W.; Wiegand, T.; Müller, K.R. Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models. ITU Journal: ICT Discoveries 2017, 1, 39–48. [Google Scholar]
- Barber, R.F.; Candès, E.J.; Ramdas, A.; Tibshirani, R.J. Predictive inference with the jackknife+. Annals of Statistics 2015, 48, 2797–2825. [Google Scholar] [CrossRef]
- Vovk, V.; Gammerman, A.; Shafer, G. Algorithmic Learning in a Random World; Springer, 2005. [Google Scholar]
- Bastos, J.A. Conformal prediction of option prices. Expert Systems with Applications 2024, 245, 123087. [Google Scholar] [CrossRef]
- Johansson, U.; Boström, H.; Löfström, T. Conformal prediction in financial applications. In Proceedings of the Conformal and Probabilistic Prediction and Applications. PMLR, 2017; pp. 209–225. [Google Scholar]
- Papadopoulos, H.; Proedrou, K.; Vovk, V.; Gammerman, A. Inductive confidence machines for regression. In Conformal Prediction for Reliable Machine Learning; Morgan Kaufmann, 2011; pp. 65–80. [Google Scholar]
- Shafer, G.; Vovk, V. A tutorial on conformal prediction. Journal of Machine Learning Research 2008, 9, 371–421. [Google Scholar]
- Bhatt, U.; Xiang, A.; Sharma, S.; Weller, A.; Taly, A.; et al. Explainable machine learning in deployment. In Proceedings of the Proceedings of the 2020 ACM Conference on Fairness, Accountability, and Transparency, 2020. [Google Scholar]
- Adadi, A.; Berrada, M. Peeking inside the black-box: A survey on Explainable Artificial Intelligence (XAI). IEEE Access 2018, 6, 52138–52160. [Google Scholar] [CrossRef]
- Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 2019, 1, 206–215. [Google Scholar] [CrossRef]
- Guidotti, R.; Monreale, A.; Ruggieri, S.; Turini, F.; Giannotti, F.; Pedreschi, D. A survey of methods for explaining black box models. ACM Computing Surveys (CSUR) 2018, 51, 1–42. [Google Scholar] [CrossRef]
- Chaddad, A.; Peng, J.; Xu, J.; Bouridane, A. Survey of explainable AI techniques in healthcare. Sensors 2023, 23, 634. [Google Scholar] [CrossRef]
- Dwivedi, R.; Dave, D.; Naik, H.; Singhal, S.; Omer, R.; Patel, P.; Qian, B.; Wen, Z.; Shah, T.; Morgan, G.; et al. Explainable AI (XAI): Core ideas, techniques, and solutions. ACM computing surveys 2023, 55, 1–33. [Google Scholar] [CrossRef]
- Schwalbe, G.; Finzel, B. A comprehensive taxonomy for explainable artificial intelligence: a systematic survey of surveys on methods and concepts. Data Mining and Knowledge Discovery 2024, 38, 3043–3101. [Google Scholar] [CrossRef]
- He, W.; Jiang, Z.; Xiao, T.; Xu, Z.; Li, Y. A survey on uncertainty quantification methods for deep learning. ACM Computing Surveys 2025. [Google Scholar] [CrossRef]
- Kabir, H.D.; Khosravi, A.; Hosen, M.A.; Nahavandi, S. Neural network-based uncertainty quantification: A survey of methodologies and applications. IEEE access 2018, 6, 36218–36234. [Google Scholar] [CrossRef]
- Shi, Y.; Wei, P.; Feng, K.; Feng, D.C.; Beer, M. A survey on machine learning approaches for uncertainty quantification of engineering systems. Machine Learning for Computational Science and Engineering 2025, 1, 11. [Google Scholar] [CrossRef]
- Grinsztajn, L.; Oyallon, E.; Varoquaux, G. Why do tree-based models still outperform deep learning on typical tabular data? Advances in neural information processing systems 2022, 35, 507–520. [Google Scholar]
- Shwartz-Ziv, R.; Armon, A. Tabular data: Deep learning is not all you need. Information Fusion 2022, 81, 84–90. [Google Scholar] [CrossRef]
- Nouretdinov, I.; Gammerman, A.; Vovk, V. Machine learning classification with confidence: Application of transductive conformal predictors to MRI-based diagnostic and prognostic markers in depression. NeuroImage 2011, 56, 1508–1517. [Google Scholar] [CrossRef] [PubMed]
- Bhattacharyya, S.; Jha, S.; Tharakunnel, K.; Westland, J.C. Data mining for credit card fraud: A comparative study. Decision Support Systems 2011, 50, 602–613. [Google Scholar] [CrossRef]


| Aspect | ITCF | Prior Methods |
|---|---|---|
| Explainability | Feature-level, case-specific explanations combined with uncertainty | XAI without uncertainty; static explanations |
| Uncertainty | Formal, distribution-free coverage guarantees | Bayesian or ensemble methods without formal guarantees |
| Operationalization | Workflow routing based on confidence and explanations | Binary flags, no prioritization |
| Computational Efficiency | Conditional explanation generation | Explanations generated for all predictions |
| Metric | Random Forest | XGBoost |
|---|---|---|
| Accuracy | 0.999683 | 0.999705 |
| Precision | 0.980620 | 0.960756 |
| Recall | 0.769933 | 0.804626 |
| F1-Score | 0.862598 | 0.875787 |
| AUC-ROC | 0.999232 | 0.998545 |
| Model | Conformal Coverage | Avg Prediction Set Size | Uncertainty Quantification Score | Calibration Quality |
|---|---|---|---|---|
| XGBoost | 0.9 | 1.05 | 0.95 | 1.0 |
| Random Forest | 0.9 | 1.10 | 0.90 | 1.0 |
| Rank | Feature | Importance |
|---|---|---|
| 1 | amount | 1.352244 |
| 2 | oldbalanceOrg | 1.319357 |
| 3 | type | 0.390689 |
| 4 | newbalanceOrig | 0.196532 |
| 5 | newbalanceDest | 0.184403 |
| 6 | oldbalanceDest | 0.171549 |
| 7 | step | 0.115079 |
| Model | Empty Regions | Single-Class Regions | Ambiguous Regions | Coverage | Avg. Region Size |
|---|---|---|---|---|---|
| Random Forest | 126,510 (9.94%) |
1,146,014 (90.06%) |
0 (0.00%) |
0.9006 | 0.9006 |
| XGBoost | 126,234 (9.92%) |
1,146,290 (90.08%) |
0 (0.00%) |
0.9008 | 0.9008 |
| Pathway | Description |
|---|---|
| High Certainty, Clear Explanation | Transaction processed automatically; no review required |
| High Uncertainty, Clear Explanation | Analyst reviews transaction using actionable LIME rationale |
| High Uncertainty, Ambiguous Features | Analyst conducts detailed investigation with available insights |
| Study | Dataset | Method | Coverage | Avg Set Size | XAI+CP |
|---|---|---|---|---|---|
| Reference [16] | Balanced | CP | 90% | 1.10 | ✗ |
| Reference [12] | ImageNet | LIME | N/A | N/A | ✗ |
| Reference [33] | Credit (10:1) | CP | 91% | 1.15 | ✗ |
| Reference [34] | Network Intrusion | CP | 89% | 1.08 | ✗ |
| ITC Framework | PaySim (773.70:1) | CP + LIME | 90.08% | 0.9008 | ✔ |
| Model | Quantile Threshold |
|---|---|
| Random Forest | 0.000150 |
| XGBoost | 0.000034 |
| Model | Empty Regions | Single-Class Regions | Ambiguous Regions | Coverage | Avg. Region Size |
|---|---|---|---|---|---|
| Random Forest | 126,510 (9.94%) |
1,146,014 (90.06%) |
0 (0.00%) |
0.9006 | 0.9006 |
| XGBoost | 126,234 (9.92%) |
1,146,290 (90.08%) |
0 (0.00%) |
0.9008 | 0.9008 |
| Class | Mean Entropy | Mean Max Probability | Ambiguous Regions | Count |
|---|---|---|---|---|
| Fraud | 0.1415 | 0.9374 | 0 / 1,643 | 1,643 |
| Non-Fraud | 0.0010 | 0.9998 | 0 / 1,270,881 | 1,270,881 |
| Index | Prob (No Fraud) | Prob (Fraud) | Entropy | Region |
|---|---|---|---|---|
| 17257 | 0.5190 | 0.4810 | 0.6924 | [] |
| 25075 | 0.7445 | 0.2555 | 0.5683 | [] |
| 54453 | 0.5481 | 0.4519 | 0.6885 | [] |
| 55703 | 0.3186 | 0.6814 | 0.6258 | [] |
| 57842 | 0.2722 | 0.7278 | 0.5854 | [] |
| Index | Prob (No Fraud) | Prob (Fraud) | Entropy | Region |
|---|---|---|---|---|
| 630 | 0.0164 | 0.9836 | 0.0839 | [] |
| 901 | 0.0000 | 1.0000 | 0.0000 | [1] |
| 1651 | 0.0000 | 1.0000 | 0.0000 | [1] |
| 2046 | 0.0000 | 1.0000 | 0.0000 | [1] |
| 2278 | 0.0000 | 1.0000 | 0.0000 | [1] |
| Index | Prob (No Fraud) | Prob (Fraud) | Entropy | Region |
|---|---|---|---|---|
| 21450 | 0.6783 | 0.3217 | 0.6281 | [] |
| 22423 | 0.7193 | 0.2807 | 0.5936 | [] |
| 37408 | 0.7574 | 0.2426 | 0.5540 | [] |
| 51254 | 0.6338 | 0.3662 | 0.6569 | [] |
| 52649 | 0.7451 | 0.2549 | 0.5677 | [] |
| Metric | Random Forest | XGBoost |
|---|---|---|
| Coverage (Validity) | 0.9006 | 0.9008 |
| Empty (Uncertain) Regions | 9.94% | 9.92% |
| Ambiguous Regions | 0 | 0 |
| Mean Entropy (Fraud) | – | 0.1415 |
| Mean Entropy (Non-Fraud) | – | 0.0010 |
| Mean Max Probability (Fraud) | – | 0.9374 |
| Mean Max Probability (Non-Fraud) | – | 0.9998 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).