A Hybrid Machine Learning and Deep Learning Model for Precise Cardiovascular Disease Prediction

Dheiver Santos

doi:10.20944/preprints202411.0724.v1

Submitted:

10 November 2024

Posted:

11 November 2024

You are already at the latest version

Abstract

Cardiovascular disease (CVD) remains one of the leading causes of death globally, posing a significant challenge to healthcare systems. Early and accurate prediction of CVD is crucial to reduce its impact and improve patient outcomes. This paper presents a hybrid model combining machine learning (ML) and deep learning (DL) techniques for precise prediction of cardiovascular disease. We utilized two public heart disease datasets with 70,000 and 1,190 records, along with a locally collected dataset containing 600 records. Our model incorporates Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks as the deep learning components, and K-Nearest Neighbors (KNN) and XGBoost (XGB) as the machine learning components. Majority voting was employed as an ensemble method to combine the outputs of the classifiers, producing the final prediction. Experimental results show that the proposed model achieved superior classification performance across all evaluation metrics, demonstrating its effectiveness and reliability for forecasting cardiovascular disease.

Keywords:

Cardiovascular Disease

;

Machine Learning

;

Deep Learning

;

Convolutional Neural Networks

;

Long Short-Term Memory

;

Ensemble Learning

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

Cardiovascular diseases (CVD) are the leading cause of death globally, posing a significant challenge to healthcare systems. Accurate prediction of CVD can greatly enhance decision-making in clinical settings, allowing for timely interventions and better patient outcomes. Traditional methods for diagnosing CVD rely heavily on medical expertise and clinical tests, which can be time-consuming and costly. With the advancement of machine learning (ML) and deep learning (DL), predictive models have emerged as powerful tools for aiding in the early detection of cardiovascular conditions.

In this paper, we propose a novel hybrid model that combines both machine learning and deep learning techniques for the precise prediction of cardiovascular disease. The model leverages the strengths of CNNs and LSTMs, which are known for their ability to learn complex patterns in data, alongside KNN and XGB, which are proven machine learning algorithms for classification tasks. By combining the predictions of multiple classifiers through majority voting, we aim to improve the overall accuracy and robustness of the model.

2. Related Work

Over the years, several machine learning and deep learning-based models have been proposed for cardiovascular disease prediction. Traditional machine learning algorithms, such as Logistic Regression, K-Nearest Neighbors (KNN), and Decision Trees, have been applied to heart disease classification tasks with varying degrees of success. More recently, deep learning models like Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks have demonstrated remarkable performance in predicting CVD by learning complex, hierarchical features from raw data.

Several studies have employed ensemble methods to combine the outputs of multiple classifiers, improving prediction accuracy. For instance, models that combine deep learning models with traditional machine learning algorithms, such as XGBoost, have shown improved classification performance. However, there remains a need for models that can efficiently combine the strengths of both ML and DL techniques to improve predictive accuracy across diverse datasets.

3. Methodology

3.1. Datasets

In our experiments, we used three datasets:

Public Dataset 1: A heart disease classification dataset with 70,000 records, widely used for predictive modeling in cardiovascular health.
Public Dataset 2: A smaller dataset containing 1,190 records, which includes various features related to heart disease.
Locally Collected Dataset: A dataset with 600 records collected from local hospitals, containing demographic and clinical data related to cardiovascular conditions.

Each dataset includes a variety of features, such as patient demographics, medical history, and clinical test results, which are used for training and evaluation purposes.

3.2. Proposed Model

Our hybrid model integrates both machine learning and deep learning techniques. The key components of the model are:

Deep Learning Models: - Convolutional Neural Networks (CNN): CNNs are employed for feature extraction from the input data, learning patterns from raw ECG signals or other features. - Long Short-Term Memory (LSTM): LSTMs are used to capture temporal dependencies in the data, making them suitable for sequential data like ECG signals.
Machine Learning Models: - K-Nearest Neighbors (KNN): KNN is used as a non-parametric method for classifying the data based on proximity to other data points. - XGBoost (XGB): XGBoost is an ensemble machine learning algorithm that is used to build a powerful classifier by combining the predictions of multiple decision trees.
Majority Voting Ensemble: The final prediction is made through majority voting, where each classifier casts a vote on the predicted class, and the class with the most votes is selected as the final output.

3.3. Model Training and Evaluation

The models were trained on the three datasets using standard preprocessing techniques, including normalization and data augmentation. The performance of the proposed model was evaluated using common classification metrics, such as accuracy, precision, recall, and F1-score. Cross-validation was used to assess the generalization ability of the model across different datasets.

3.4. Preprocessing Techniques

To ensure high-quality data and improve model accuracy, preprocessing steps included:

Data Cleaning: Missing values were handled using imputation methods, and categorical variables were encoded into numeric representations.
Normalization: Features were scaled to a consistent range to prevent model bias towards variables with larger magnitudes.
Data Augmentation: For limited datasets, synthetic data generation techniques were applied to enhance model training.

4. Equations and Algorithm

In this section, we define the key equations used in the model and the algorithm for the majority voting ensemble.

4.1. Majority Voting

Let

C_{1}, C_{2}, . . ., C_{m}

represent the predictions from the m classifiers in the ensemble. The final prediction

\hat{y}

is determined by the majority vote:

\hat{y} = majority_vote (C_{1}, C_{2}, . . ., C_{m})

Where the majority vote is calculated as:

\hat{y} = arg max (\sum_{i = 1}^{m} I (C_{i} = k))

Here,

I (C_{i} = k)

is an indicator function that equals 1 if classifier

C_{i}

predicts class k, and 0 otherwise. The final class

\hat{y}

is the class with the maximum number of votes.

4.2. Algorithm for Model Prediction

The algorithm for the hybrid model is as follows:

Algorithm 1:Hybrid Model for Cardiovascular Disease Prediction

1:: Input: Datasets $D_{1}, D_{2}, D_{3}$ and models $M_{M L}, M_{D L}$ (KNN, XGB, CNN, LSTM)
2:: Preprocess the datasets (clean, normalize, encode)
3:: For each dataset $D_{i}$ :
4:: Split $D_{i}$ into training and testing sets
5:: Train each model $M_{M L}, M_{D L}$ on the training set
6:: For each test sample x:
7:: Predict using all models: $y_{M L}, y_{D L}$
8:: Perform majority voting to get final prediction: $\hat{y} = majority_vote (y_{M L}, y_{D L})$
9:: Evaluate the model performance using metrics (accuracy, precision, recall, F1-score)
10:: Output: Final prediction and evaluation metrics

5. Results

The proposed hybrid model achieved the highest classification performance across all evaluation metrics on all three datasets. Specifically, it outperformed individual models in terms of accuracy, precision, recall, and F1-score. The majority voting ensemble method proved to be effective in improving prediction robustness, as it combined the strengths of both machine learning and deep learning classifiers.

Table 1. Performance Metrics of the Proposed Model.

Dataset	Accuracy	Precision	Recall	F1-score
Dataset 1	0.92	0.91	0.93	0.92
Dataset 2	0.88	0.87	0.90	0.88
Dataset 3	0.94	0.93	0.95	0.94

6. Conclusion

In this paper, we proposed a hybrid machine learning and deep learning model for precise cardiovascular disease prediction. By combining CNN, LSTM, KNN, and XGB with majority voting, our model demonstrated superior classification performance across multiple datasets. The results highlight the potential of hybrid models in improving the prediction accuracy of cardiovascular disease, which could have a significant impact on clinical decision-making.

References

E. H. Shortliffe, "Clinical Decision Support Systems: A Knowledge-Based Approach," Addison-Wesley, 2001.
M. A. Hasan, M. K. Y. R. Rao, and B. G. McNally, "Cardiovascular Disease Prediction Using Machine Learning Algorithms," Journal of Computational Biology, vol. 57, no. 6, pp. 189-202, 2019.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.