Submitted:
22 July 2025
Posted:
23 July 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Background
3. Related Work
4. Methodology
4.1. Dataset Source
4.2. Dataset Description
| Feature Name | Datatype | Description |
| age | Numerical | Age of the person |
| sex | Categorical | Gender (male or female) |
| bmi | Numerical | Body Mass Index -a measure of body fat |
| children | Numerical | Number of children covered by the insurance |
| smoker | Categorical | Whether the person smokes (yes or no) |
| region | Categorical | Area of residence in the U.S. |
| charges | Numerical | Medical insurance cost (target variable) |
4.3. Data Preprocessing
4.4. Feature Engineering
4.5. Exploratory Data Analysis (EDA)
4.5.1. Distribution of Charges
4.5.2. Impact of Categorical Features on Insurance Charges
4.6. Model Building and Evaluation
4.6.1. Model Creation and Implementation
- Pandas and NumPy were utilized for structured data processing and numerical computations.
- Matplotlib and Seaborn were employed to generate visual representations for understanding data distributions and patterns.
- Scikit-learn provided the framework for building and evaluating machine learning models such as Linear Regression and Random Forest.
- XGBoost was used for implementing an advanced gradient boosting regression model.
- TensorFlow and Keras (Sequential) facilitated the development and training of the Artificial Neural Network.
- SHAP for model interpretability
- Jupyter Notebook served as the interactive coding and documentation environment for the entire project.
4.6.2. Model Evaluation
4.7. Explainable AI (XAi)
5. Results & Discussion
- In comparison to other models, the ANN model boasts better accuracy, with its maximum R² value at 0.88, and minimum RMSE at 4680.82 in terms of predictive accuracy. This supports the design's intentions of capturing complex relationships found in the data.
- XGBoost and Random Forest performed evenly, with R² values near 0.86, good MAE and MAPE scores, and reasonable capability to learn non-linear trends.
- The Linear Regression model underperformed, with a consistent predictive accuracy, but had higher MAE and MAPE scores versus random forest and XG. It is simplistic and interpretable, but suggests that linear assumptions were not sufficient to model medical insurance charges accurately.
- Cross-validation scores show that the models are consistent and have the ability to generalize. The mean R² of 0.9886 from 5-fold validation was the highest for ANN, and suggests it can be robust.
- Models with ensemble or deep learning architectures had significantly more impact from the feature engineering approaches in the pre-processing phase, particularly binning and encoding.
6. Model Interpretation
7. Conclusion
8. Future Work
- Expand and Diversify the Dataset: Include real-world, large-scale insurance records along with more personal and health-related details to improve model accuracy and generalization.
- Advanced Feature Engineering: Since key features that affect the ANN model were created through feature engineering, it's important to explore meaningful and clear feature construction to improve performance and explainability.
- Model Deployment: Create a user-friendly application, either web or mobile, that allows users to enter personal data and get real-time insurance cost predictions.
- Improve Model Interpretability: Use Explainable AI techniques, including SHAP interaction values or DeepSHAP, to gain a better understanding of complex feature relationships in deep learning models.
- Fairness and Bias Assessment: Review model predictions across different demographic groups to ensure fair predictions and reduce potential bias in real-world applications.
Acknowledgments
References
- Orji, U., & Ukwandu, E. (2024). Machine learning for an explainable cost prediction of medical insurance. Machine Learning with Applications, 15, 100516. [CrossRef]
- Kshirsagar, R., Hsu, L.-Y., Chaturvedi, V., Greenberg, C. H., McClelland, M., Hasson, H., et al. (2021). Accurate and interpretable machine learning for transparent pricing of health insurance plans. Proceedings of the AAAI Conference on Artificial Intelligence. https://ojs.aaai.org/index.php/AAAI/article/view/17351.
- Cenita, J. A. S. , Asuncion, P. R. F., & Victoriano, J. M. (2023). Performance evaluation of regression models in predicting the cost of medical insurance. International Journal of Computing Sciences Research 7, 2052–2065. [CrossRef]
- Patra, G. K. , Kuraku, C., Konkimalla, S., Boddapati, V. N., Sarisa, M., & Reddy, M. S. (2024). An analysis and prediction of health insurance costs using machine learning-based regressor techniques. Journal of Data Analysis and Information Processing, 12. [CrossRef]
- Olaoye, G. (2025). Comparative study of machine learning models for predicting health insurance costs. SSRN Electronic Journal. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5130267.
- Bau, Y.-T., & Md Hanif, S. A. (2024). Comparative analysis of machine learning algorithms for health insurance pricing. International Journal on Informatics Visualization, 8(1), 481–491. http://www.joiv.org/index.php/joiv/article/view/2282.
- Balakrishnan, S. G., Abdulla, N., Krishnan, H. S., Gokul, V., & Amizhthan, S. P. (2024). Medical insurance cost analysis and prediction using extreme gradient boosting algorithm. ShodhKosh: Journal of Visual and Performing Arts, 5(6), 1816–1822. [CrossRef]
- Jyothsna, C. , Sravanth, A. E., Srinivas, K., Kumar, A. T., Bhargavi, B., & Kumar, J. N. V. R. S. (2022). Health insurance premium prediction using XGBoost regressor. Proceedings of the IEEE International Conference on Artificial Intelligence and Computer Applications (ICAAIC). [CrossRef]
- Chintala, S. K. (2022). AI in public health: Modelling disease spread and management strategies. NeuroQuantology, 20(8), 10830–10838. [CrossRef]
- Lundberg, S. M. , & Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, 30. Retrieved from https://proceedings.neurips.cc/paper_files/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf.
- Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794). [CrossRef]
- Breiman, L. (2001). Random forests. Machine Learning, 45. [CrossRef]
- Chollet, F. (2015). Keras: The Python deep learning library. Retrieved from https://keras.io.
- Scikit-learn Developers. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830. Retrieved from https://scikit-learn.org.
- Nguyen, A. D. , & Nguyen, T. (2020). Health insurance cost prediction using machine learning. International Journal of Advanced Computer Science and Applications 11(6), 491–496. [CrossRef]








| Feature Name | Datatype | Description |
| age_group_Adult | Categorical | 1 if age is classified as Adult, 0 otherwise. |
| bmi_category_Obese |
Categorical | 1 if BMI category is Obese, 0 otherwise. |
| is_parent | Binary | Indicates whether the person has one or more children (1 = Yes, 0 = No). |
| risk_score | Numeric | Custom score calculated based on age, BMI, and smoking status to indicate risk. |
| region_bmi_avg_diff | Numeric | Difference between individual's BMI and average BMI in their region. |
| region_northwest | Binary | One-hot encoded column for Northwest region. |
| region_southeast | Binary | One-hot encoded column for Southeast region. |
| region_southwest | Binary | One-hot encoded column for Southwest region. |
| sex_male | Binary | One-hot encoded column (1 = Male, 0 = Female). |
| smoker_yes | Binary | One-hot encoded column (1 = Smoker, 0 = Non-smoker). |
| Children | Numeric | Number of dependent children covered under the insurance plan. |
| bmi | Numeric | Body Mass Index, a measure of body fat based on height and weight. |
| age | Numeric | Age of the individual (in years). |
| charges | Numeric | Actual medical insurance cost charged. |
| age_group_Senior |
Categorical |
1 if age is classified as Senior, 0 otherwise. |
| age_group_Mid-Age |
Categorical | 1 if age is classified as Mid-Age, 0 otherwise. |
| bmi_category_Overweight |
Categorical |
1 if BMI category is Overweight, 0 otherwise. |
| bmi_category_Underweight |
Categorical |
1 if BMI category is Underweight, 0 otherwise. |
| Model | MAE | RMSE | MAPE % | |
| Linear Regression |
2873.51 |
4927.88 |
0.85 |
29.32 % |
| Random Forest |
2635.00 |
4842.31 |
0.86 |
31.80 % |
| XGBoost |
2677.80 |
4867.41 |
0.86 |
29.44 % |
| ANN (Deep Learning) |
2825.85 |
4680.82 |
0.88 |
37.99 % |
| Model | K-Fold Cross Validation |
| Linear Regression |
0.84 |
| Random Forest |
0.84 |
| XGBoost |
0.86 |
| ANN (Deep Learning) |
0.9886 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).