Working Paper Article Version 2 This version is not peer-reviewed

A Comparative Analysis of Machine Learning Models for Prediction of Insurance Uptake in Kenya

Version 1 : Received: 8 October 2020 / Approved: 9 October 2020 / Online: 9 October 2020 (08:41:24 CEST)
Version 2 : Received: 11 February 2021 / Approved: 11 February 2021 / Online: 11 February 2021 (10:58:40 CET)

A peer-reviewed article of this Preprint also exists.

Yego, N.K.; Kasozi, J.; Nkurunziza, J. A Comparative Analysis of Machine Learning Models for the Prediction of Insurance Uptake in Kenya. Data 2021, 6, 116. Yego, N.K.; Kasozi, J.; Nkurunziza, J. A Comparative Analysis of Machine Learning Models for the Prediction of Insurance Uptake in Kenya. Data 2021, 6, 116.


The role of insurance in financial inclusion as well as in economic growth is immense. However, low uptake seems to impede the growth of the sector hence the need for a model that robustly predicts uptake of insurance among potential clients. In this research, we compared the performances of eight (8) machine learning models in predicting the uptake of insurance. The classifiers considered were Logistic Regression, Gaussian Naive Bayes, Support Vector Machines, K Nearest Neighbors, Decision Tree, Random Forest, Gradient Boosting Machines and Extreme Gradient boosting. The data used in the classification was from the 2016 Kenya FinAccess Household Survey. Comparison of performance was done for both upsampled and downsampled data due to data imbalance. For upsampled data, Random Forest classifier showed highest accuracy and precision compared to other classifiers but for down sampled data, gradient boosting was optimal. It is noteworthy that for both upsampled and downsampled data, tree-based classifiers were more robust than others in insurance uptake prediction. However, in spite of hyper-parameter optimization, the area under receiver operating characteristic curve remained highest for Random Forest as compared to other tree-based models. Also, the confusion matrix for Random Forest showed least false positives, and highest true positives hence could be construed as the most robust model for predicting the insurance uptake. Finally, the most important feature in predicting uptake was having a bank product hence bancassurance could be said to be a plausible channel of distribution of insurance products.


Insurance Uptake; Machine Learning; Upsample; Downsample


Computer Science and Mathematics, Algebra and Number Theory

Comments (1)

Comment 1
Received: 11 February 2021
Commenter: Nelson Kemboi Yego
Commenter's Conflict of Interests: Author
Comment: Some changes in abstract, keywords and one of the author names (change from Nkrunziza to Nkurunziza).
+ Respond to this comment

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 1
Metrics 0

Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.