Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

Evaluation of Synthetic Categorical Data Generation Techniques for Predicting Cardiovascular Diseases and Post-hoc Interpretability of the Risk Factors

Version 1 : Received: 3 February 2023 / Approved: 7 February 2023 / Online: 7 February 2023 (03:43:16 CET)

A peer-reviewed article of this Preprint also exists.

García-Vicente, C.; Chushig-Muzo, D.; Mora-Jiménez, I.; Fabelo, H.; Gram, I.T.; Løchen, M.-L.; Granja, C.; Soguero-Ruiz, C. Evaluation of Synthetic Categorical Data Generation Techniques for Predicting Cardiovascular Diseases and Post-Hoc Interpretability of the Risk Factors. Appl. Sci. 2023, 13, 4119. García-Vicente, C.; Chushig-Muzo, D.; Mora-Jiménez, I.; Fabelo, H.; Gram, I.T.; Løchen, M.-L.; Granja, C.; Soguero-Ruiz, C. Evaluation of Synthetic Categorical Data Generation Techniques for Predicting Cardiovascular Diseases and Post-Hoc Interpretability of the Risk Factors. Appl. Sci. 2023, 13, 4119.

Abstract

Machine Learning (ML) methods have become important to enhance the performance of decision-support predictive models. However, class imbalance is one of the main challenges for developing ML models, because it limits the generalization of these models, and biases the learning algorithms. In this paper, we consider oversampling methods for generating synthetic categorical clinical data aiming to improve the predictive performance in ML models, and the identification of risk factors for cardiovascular diseases (CVDs). We performed a comparative study of several categorical synthetic data generation methods, including Generative Adversarial Networks (GANs). Then, we assessed the impact of combining oversampling strategies and linear and nonlinear supervised ML methods. Lastly, we conducted a post-hoc model interpretability based on the importance of the risk factors. Experimental results show the potential of GAN-based models for generating high-quality categorical synthetic data, yielding probability mass functions that are highly close to real data, maintaining relevant insights, and contributing to increase the predictive performance. The GAN-based model and a linear classifier outperforms other oversampling techniques, improving 2\% the area under the curve. These results demonstrate the capability of synthetic data to help both in determining risk factors and building models for CVD prediction.

Keywords

Synthetic categorical data generation; generative adversarial networks; imbalance learning; CTGAN; interpretable machine learning; cardiovascular disease

Subject

Computer Science and Mathematics, Artificial Intelligence and Machine Learning

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.