1. Introduction
Cardiovascular diseases (CVD) are the leading cause of death globally, posing a significant challenge to healthcare systems. Accurate prediction of CVD can greatly enhance decision-making in clinical settings, allowing for timely interventions and better patient outcomes. Traditional methods for diagnosing CVD rely heavily on medical expertise and clinical tests, which can be time-consuming and costly. With the advancement of machine learning (ML) and deep learning (DL), predictive models have emerged as powerful tools for aiding in the early detection of cardiovascular conditions.
In this paper, we propose a novel hybrid model that combines both machine learning and deep learning techniques for the precise prediction of cardiovascular disease. The model leverages the strengths of CNNs and LSTMs, which are known for their ability to learn complex patterns in data, alongside KNN and XGB, which are proven machine learning algorithms for classification tasks. By combining the predictions of multiple classifiers through majority voting, we aim to improve the overall accuracy and robustness of the model.
2. Related Work
Over the years, several machine learning and deep learning-based models have been proposed for cardiovascular disease prediction. Traditional machine learning algorithms, such as Logistic Regression, K-Nearest Neighbors (KNN), and Decision Trees, have been applied to heart disease classification tasks with varying degrees of success. More recently, deep learning models like Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks have demonstrated remarkable performance in predicting CVD by learning complex, hierarchical features from raw data.
Several studies have employed ensemble methods to combine the outputs of multiple classifiers, improving prediction accuracy. For instance, models that combine deep learning models with traditional machine learning algorithms, such as XGBoost, have shown improved classification performance. However, there remains a need for models that can efficiently combine the strengths of both ML and DL techniques to improve predictive accuracy across diverse datasets.
3. Methodology
3.1. Datasets
In our experiments, we used three datasets:
Public Dataset 1: A heart disease classification dataset with 70,000 records, widely used for predictive modeling in cardiovascular health.
Public Dataset 2: A smaller dataset containing 1,190 records, which includes various features related to heart disease.
Locally Collected Dataset: A dataset with 600 records collected from local hospitals, containing demographic and clinical data related to cardiovascular conditions.
Each dataset includes a variety of features, such as patient demographics, medical history, and clinical test results, which are used for training and evaluation purposes.
3.2. Proposed Model
Our hybrid model integrates both machine learning and deep learning techniques. The key components of the model are:
Deep Learning Models: - Convolutional Neural Networks (CNN): CNNs are employed for feature extraction from the input data, learning patterns from raw ECG signals or other features. - Long Short-Term Memory (LSTM): LSTMs are used to capture temporal dependencies in the data, making them suitable for sequential data like ECG signals.
Machine Learning Models: - K-Nearest Neighbors (KNN): KNN is used as a non-parametric method for classifying the data based on proximity to other data points. - XGBoost (XGB): XGBoost is an ensemble machine learning algorithm that is used to build a powerful classifier by combining the predictions of multiple decision trees.
Majority Voting Ensemble: The final prediction is made through majority voting, where each classifier casts a vote on the predicted class, and the class with the most votes is selected as the final output.
3.3. Model Training and Evaluation
The models were trained on the three datasets using standard preprocessing techniques, including normalization and data augmentation. The performance of the proposed model was evaluated using common classification metrics, such as accuracy, precision, recall, and F1-score. Cross-validation was used to assess the generalization ability of the model across different datasets.
3.4. Preprocessing Techniques
To ensure high-quality data and improve model accuracy, preprocessing steps included:
Data Cleaning: Missing values were handled using imputation methods, and categorical variables were encoded into numeric representations.
Normalization: Features were scaled to a consistent range to prevent model bias towards variables with larger magnitudes.
Data Augmentation: For limited datasets, synthetic data generation techniques were applied to enhance model training.
4. Equations and Algorithm
In this section, we define the key equations used in the model and the algorithm for the majority voting ensemble.
4.1. Majority Voting
Let
represent the predictions from the
m classifiers in the ensemble. The final prediction
is determined by the majority vote:
Where the majority vote is calculated as:
Here, is an indicator function that equals 1 if classifier predicts class k, and 0 otherwise. The final class is the class with the maximum number of votes.
4.2. Algorithm for Model Prediction
The algorithm for the hybrid model is as follows:
|
Algorithm 1:Hybrid Model for Cardiovascular Disease Prediction |
- 1:
Input: Datasets and models (KNN, XGB, CNN, LSTM)
- 2:
Preprocess the datasets (clean, normalize, encode)
- 3:
For each dataset :
- 4:
Split into training and testing sets
- 5:
Train each model on the training set
- 6:
For each test sample x:
- 7:
Predict using all models:
- 8:
Perform majority voting to get final prediction:
- 9:
Evaluate the model performance using metrics (accuracy, precision, recall, F1-score)
- 10:
Output: Final prediction and evaluation metrics
|
5. Results
The proposed hybrid model achieved the highest classification performance across all evaluation metrics on all three datasets. Specifically, it outperformed individual models in terms of accuracy, precision, recall, and F1-score. The majority voting ensemble method proved to be effective in improving prediction robustness, as it combined the strengths of both machine learning and deep learning classifiers.
Table 1.
Performance Metrics of the Proposed Model.
Table 1.
Performance Metrics of the Proposed Model.
| Dataset |
Accuracy |
Precision |
Recall |
F1-score |
| Dataset 1 |
0.92 |
0.91 |
0.93 |
0.92 |
| Dataset 2 |
0.88 |
0.87 |
0.90 |
0.88 |
| Dataset 3 |
0.94 |
0.93 |
0.95 |
0.94 |
6. Conclusion
In this paper, we proposed a hybrid machine learning and deep learning model for precise cardiovascular disease prediction. By combining CNN, LSTM, KNN, and XGB with majority voting, our model demonstrated superior classification performance across multiple datasets. The results highlight the potential of hybrid models in improving the prediction accuracy of cardiovascular disease, which could have a significant impact on clinical decision-making.
References
- E. H. Shortliffe, "Clinical Decision Support Systems: A Knowledge-Based Approach," Addison-Wesley, 2001.
- M. A. Hasan, M. K. Y. R. Rao, and B. G. McNally, "Cardiovascular Disease Prediction Using Machine Learning Algorithms," Journal of Computational Biology, vol. 57, no. 6, pp. 189-202, 2019.
|
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).