Submitted:
10 June 2025
Posted:
11 June 2025
Read the latest preprint version here
Abstract
Keywords:
Literature Review
Early Detection Strategies Using Supervised Learning
Prompt Engineering and Contextual Retrieval in RAG Systems
Bridging Structured ML and Language-Based Reasoning
Methods
Participants
Materials and Measures
Demographic Variables
Cash Game Measures
Tournament Measures
Deposit Measures
Withdrawal Measures
Feature Engineering
Procedure
Data Acquisition and Access
Data Preprocessing
Feature Engineering
- Total number of cash game sessions (‘engage_cash_Windows_count’)
- Maximum number of cash game sessions in a single month (‘engage_cash_Windows_max_month’)
- Total number of tournaments played (‘engage_tourn_Trnmnts_count’)
- Maximum number of tournaments in a single month (‘engage_tourn_Trnmnts_max_month’)
- Average amount of money staked per tournament (‘engage_tourn_StakesT_mean’)
- Total amount deposited across the study period (‘monetary_deposit_Amount_sum’)
- Total number of deposits made (‘monetary_deposit_Amount_count’)
Modeling Pipeline
XGBoost Classifier
SHAP (SHapley Additive exPlanations)
- Total amount deposited across the study period (‘monetary_deposit_Amount_sum’)
- Total number of cash game sessions (‘engage_cash_Windows_count’)
- Total number of tournaments played (‘engage_tourn_Trnmnts_count’)
Retrieval-Augmented Generation (RAG)
Labeling Strategy
Evaluation Process
Filter Features and Label Leakage
Random Seeds
Cross-Validation
Optuna Hyperparameter Tuning
Model Evaluation Metrics
SHAP Integration

Tools and Environment
Results
Performance
Impact of Labeling Logic
SHAP Feature Stability
- monetary_deposit_Amount_max_month: With a median SHAP of 0.9257 and variance of 0.1467, this feature consistently ranks at the top, demonstrating that high deposit volumes within a single month are a dominant signal of problematic gambling behavior.
- engage_tourn_Trnmnts_count: With a median SHAP of 0.7151 and variance of 0.1597, the total number of tournaments played is a stable and highly impactful feature.
- engage_cash_Windows_count: With a median SHAP of 0.5231 and variance of 0.1761, while showing slightly less impact than tournaments and deposits, the behavioral clarity justifies inclusion in addiction identification.
- A positive logit leads to a probability above 0.5, suggesting a prediction in favor of addiction.
- A negative logit results in a probability below 0.5, indicating a prediction against addiction.
- The further from zero the logit is, the stronger the model’s confidence.
Demographic Patterns
RAG Classification Performance
- User 985 was classified as addicted based on high tournament activity, stakes, and deposit levels, despite SHAP indicating cash windows as the main contributor.
- User 22095 was predicted as not addicted due to minimal behavioral engagement and universally negative SHAP values.
- User 69225 was also labeled addicted, showing strong engagement across all domains: frequent deposits, high tournament activity, and consistent cash play, aligning with commonly observed addiction patterns.
Conclusions
Discussion
Limitations
Future Directions
Data Availability Statement
Acknowledgements
Appendix A



References
- Aalbers, T., McKenna, B., & Hing, N. (2022). Predicting gambling problems using player tracking data: A systematic review. Addiction, 117(3), 565–579. [CrossRef]
- Auer, M., & Griffiths, M. D. (2015). Testing normative and self-appraisal feedback in an online slot-machine pop-up in a real-world setting. Frontiers in Psychology, 6, 339. [CrossRef]
- Belle, V., & Papantonis, I. (2021). Principles and practice of explainable machine learning. Frontiers in Big Data, 4, 688969. [CrossRef]
- Braverman, J., LaPlante, D. A., Nelson, S. E., & Shaffer, H. J. (2014). Using cross-game behavioral markers for early identification of high-risk Internet gamblers. Psychology of Addictive Behaviors, 28(2), 268–274. [CrossRef]
- Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794). [CrossRef]
- Chen, Y., Yan, L., Sun, W., Ma, X., Zhang, Y., Yin, B., & Yin, X. (2025). Improving retrieval-augmented generation through multi-agent reinforcement learning. arXiv. https://arxiv.org/abs/2501.15228.
- Division on Addiction. (2022). Second session at the virtual poker table: A contemporary study of actual online poker activity. The Transparency Project, Cambridge Health Alliance. [CrossRef]
- Doshi-Velez, F., & Kim, B. (2017). Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608. [CrossRef]
- Dreesen, J., Nuyens, F., Billieux, J., & Maurage, P. (2018). Impaired inhibitory control in problematic gamblers: Evidence from a systematic review and meta-analysis. Neuroscience & Biobehavioral Reviews, 84, 138–148. [CrossRef]
- Gainsbury, S. M., Hing, N., Delfabbro, P. H., & King, D. L. (2015). A taxonomy of gambling risk factors: Theoretical, empirical, and methodological considerations. Addiction Research & Theory, 23(6), 457–472. [CrossRef]
- Goyal, A., Friesen, A. L., Weber, T., Badia, A. P., & Blundell, C. (2022). Retrieval-augmented reinforcement learning. In Proceedings of the 39th International Conference on Machine Learning (pp. 5454–5464). PMLR. https://proceedings.mlr.press/v162/goyal22a/goyal22a.pdf.
- Hing, N., Russell, A. M. T., Browne, M., & Rockloff, M. (2019). Predicting gambling problems from electronic gaming machine play: Identifying behaviour markers using account-based data. Addiction, 114(5), 917–925. [CrossRef]
- Huang, J., Madala, S., Sidhu, R., Niu, C., Hockenmaier, J., & Zhang, T. (2025). RAG-RL: Advancing retrieval-augmented generation via reinforcement learning and curriculum learning. arXiv. https://arxiv.org/abs/2503.12759.
- Izacard, G., & Grave, E. (2021). Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (pp. 874–880). [CrossRef]
- King, D. L., Delfabbro, P. H., & Griffiths, M. D. (2020). Cognitive-behavioral approaches to behavioral addictions: A conceptual review. International Journal of Mental Health and Addiction, 18, 15–34. [CrossRef]
- Legrand, J., & Zhang, L. (2024). Towards interpretable gambling addiction prediction: A SHAP-enhanced XGBoost framework. Journal of Behavioral Data Science, 3(2), 101–115.
- Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., … & Riedel, S. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459–9474.
- Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems (pp. 4765–4774). [CrossRef]
- Lundberg, S. M., Erion, G., & Lee, S.-I. (2020). From local explanations to global understanding with explainable AI for trees. Nature Machine Intelligence, 2, 252–259. [CrossRef]
- Perrot, B., Phan, T. H., Aubin, H. J., & Simon, O. (2018). Predicting self-exclusion among online poker players: Developing a model to predict at-risk gamblers. International Gambling Studies, 18(3), 378–392. [CrossRef]
- Philander, K. S. (2014). Consumer spending at gambling establishments: Evidence from Canada. International Gambling Studies, 14(3), 338–357. [CrossRef]
- Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why should I trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1135–1144). [CrossRef]
- Yin, L., Pruksachatkun, Y., Wallace, B. C., & Bansal, M. (2023). Large language models are zero-shot clinical information extractors. Nature Communications, 14, 489. [CrossRef]

















Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).