Submitted:
20 February 2025
Posted:
24 February 2025
You are already at the latest version
Abstract

Keywords:
1. Introduction
2. Methodology
2.1. Dataset and Pre-Processing
- First, categorical variables like sex, smoker, and region were converted into numerical form using one-hot encoding.
- Next, numerical features such as age, BMI and the number of children were standardized to ensure consistency.
- Finally, the data set was divided into training, validation, and test sets using five different random seeds to make the model results more reliable.
2.2. Machine Learning Models
- Ridge Regression: This is a simple model that tries to find patterns using a straight-line approach, also known as a linear regression. However, unlike basic regression models, Ridge Regression includes a penalty term to prevent overfitting [3]. Overfitting happens when a model memorizes the training data instead of learning general patterns, which can lead to poor performance on new data. Ridge Regression helps control this by making sure that the model does not rely too heavily on any single feature.
- Random Forest: This model is made up of multiple decision trees, which are like flowcharts that split the data into smaller groups to make predictions. Instead of using just one tree, Random Forest combines many trees and takes an average of their predictions [6]. This method reduces errors and makes the model more stable, meaning it performs better on new data. Because Random Forest uses multiple trees, it is less likely to overfit compared to a single decision tree.
- XGBoost: This is an advanced model that builds on the idea of decision trees but takes it a step further. It works by making a series of trees, where each tree learns from the mistakes of the previous one. This process, called "boosting," allows the model to continuously improve its predictions by correcting errors along the way. XGBoost is known for being fast and highly accurate, making it one of the best models for complex datasets like insurance cost prediction.
2.3. Hyperparameter Optimization with Optuna
2.4. Model Evaluation Metrics
- Rsquared Score: Shows how well the model explains the variation in insurance charges. A higher value means better predictions.
- Root Mean Squared Error (RMSE): Measures how far the predictions are from the actual values. A lower RMSE means more accurate predictions.
- Mean Absolute Error (MAE): Tells us the average difference between predicted and actual values. Lower values mean better accuracy.
2.5. SHAP Explainability and Feature Selection
3. Results
4. Discussion
5. Conclusion
6. Code and Data Availability
References
- Mehrabi, N.; Gowda, S.N.; Morstatter, F. Predicting Medical Costs Using Machine Learning Approaches: A Case Study on Healthcare Claims Data. Health Informatics Journal 2021, 27, 14604582211058018. [Google Scholar]
- Patel, J.; Doshi, R. Medical Insurance Cost Prediction Using Machine Learning. International Journal for Research in Applied Science & Engineering Technology (IJRASET) 2020.
- Sharma, R.; Singh, A. A comparative study of machine learning algorithms for health insurance cost prediction. International Journal of Computer Science and Information Security 2021. [Google Scholar]
- Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019.
- Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), 2017.
- Zhang, H.; Wang, Y. Insurance Claim Cost Prediction Using Ensemble Machine Learning Models. IEEE Transactions on Computational Intelligence and AI in Healthcare 2022. [Google Scholar]
- McCoy, T.H.; Perlis, R.H.; Ghosh, S. Deep Learning for Prediction of Population Health Costs. Nature Medicine 2020. [Google Scholar]
- Ribeiro, M.T.; Singh, S.; Guestrin, C. "Why Should I Trust You?" Explaining the Predictions of Any Classifier. In Proceedings of the Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016.
- Kaur, H.; Kumari, V. Health Insurance Cost Prediction Using Deep Neural Network. Journal of Big Data 2020. [Google Scholar]
- Orji, U.; Ukwandu, E. Machine Learning For An Explainable Cost Prediction of Medical Insurance. arXiv preprint 2023, arXiv:cs.LG/2311.14139. [Google Scholar] [CrossRef]
- Choi, M. Medical Cost Personal Datasets, 2018. Kaggle Dataset.


| Model | Seed | R² Score | RMSE | MAE |
| Ridge Regression | 313718 | 0.7132 | 7108.67 | 5177.36 |
| Random Forest | 313718 | 0.8198 | 5635.96 | 2984.42 |
| XGBoost | 313718 | 0.8147 | 5714.82 | 2960.39 |
| Ridge Regression | 456789 | 0.7581 | 5791.69 | 4098.25 |
| Random Forest | 456789 | 0.8786 | 4102.73 | 2325.73 |
| XGBoost | 456789 | 0.8875 | 3950.28 | 2228.08 |
| Ridge Regression | 567890 | 0.7079 | 6067.57 | 4326.31 |
| Random Forest | 567890 | 0.8567 | 4249.98 | 2496.16 |
| XGBoost | 567890 | 0.8655 | 4117.20 | 2348.43 |
| Ridge Regression | 678901 | 0.7310 | 6053.31 | 4508.03 |
| Random Forest | 678901 | 0.8483 | 4546.45 | 2534.06 |
| XGBoost | 678901 | 0.8498 | 4523.46 | 2410.98 |
| Ridge Regression | 789012 | 0.7708 | 5665.55 | 4009.71 |
| Random Forest | 789012 | 0.8683 | 4294.57 | 2346.07 |
| XGBoost | 789012 | 0.8778 | 4136.59 | 2333.88 |
| Model | R² Score | RMSE | MAE |
| Ridge Regression | 0.7310 | 6053.31 | 4326.31 |
| Random Forest | 0.8567 | 4294.57 | 2496.16 |
| XGBoost | 0.8655 | 4136.59 | 2348.43 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).