An ensemble data mining approach to discover medical patterns and provide a system to predict the mortality in the ICU of cardiac surgery based on stacking machine learning method

ABSTRACT The most effective approach to reduce disease mortality is to diagnose it as soon as possible. As a result, data mining by applying machine learning in the field of diseases provides good opportunities to examine the hidden patterns of this collection. An exact forecast of the mortality after heart surgery will cause successful medical treatment and fewer costs. This research wants to recommend a new stacking predictive model after utilising the random forest feature importance method to foresee the mortality after heart surgery on a highly unbalanced dataset by using the most practical features. To solve the unbalanced data problem, a combination of the SVM-SMOTE over-sampling algorithm and the Edited-Nearest-Neighbour under-sampling algorithm is used. This research compares the introduced model with some other machine learning classifiers to ensure efficiency through shuffle hold-out and 10-fold cross-validation strategies. In order to validate the performance of the implemented machine learning methods in this research, both shuffle hold-out, and 10-fold cross-validation results indicated that our model had the highest efficiency compared to the other models. Furthermore, the Friedman statistical test is applied to survey the differences between models. The result demonstrates that the introduced stacking model reached the most accurate predicting performance.


Introduction
Today, in medical knowledge, collecting an immense amount of data about different diseases is very important. Because it can lead to saving many lives and improving quality of living by discovering relations and hidden patterns among disease's features (Bardhan and Thouin 2013). The healthcare industry is permanently generating large amounts of data, and there is a wide gap between data collection and data interpretation. Machine learning is a helpful tool that can benefit the industry from in-depth data analysis to the development of medical research and scientific decisions in the field of diagnosis and treatment. Diagnosis and determination of appropriate treatment for patients are of great importance in medical science. In addition to wasting time and money, choosing the wrong treatment for the patient, can also have detrimental effects and, in some cases, can even lead to the death of the patient. Therefore, it is compulsory to provide a model for finding a suitable treatment (Ratnakar et al. 2013). Machine learning is a powerful tool that uses previously recorded data as input to predict future incidents (Banerjee et al. 2021). Risk prediction plays a critical role in a clinical prescription for patients undergoing heart surgery. Heart surgery departments of hospitals are producing remarkable recorded data during a day, which should be utilised by data scientists to quantify the patient's health and foresee future incidents (Myslivecek and Benedetto 2020). The most important parameter after heart surgery is mortality because earlier diagnosis will raise the probability of patient's survival (Tuesta et al. 1999). For this reason, different machine learning methods have been extended during past decades to predict mortality and, scholars have made myriad endeavours to enhance mortality prediction accuracy (Xia et al. 2012). In 1999 J. Martinez-Alario et al. investigated and assessed the performance of general severity systems and compared them with the Parsonnet score to predict mortality after cardiac surgery. In 2021 Dimitris Bertsimas et al. used a Decision tree for its interpretability and accuracy to predict mortality, postoperative mechanical ventilatory support time (MVST), and hospital length of stay (LOS) for patients who underwent Congenital Heart Surgery. They compared it to Logistic regression, Random Forest and Gradient Boosting. It turned out that Optimal classification trees outperform across all three models by reporting area under the curve (Bertsimas et al. 2021). There is a measure called the Euro score for predicting the mortality and probability of heart failure. Although most of the time it causes an overestimation, it continues to be used in the United Kingdom as regards a lack of alternative validated models. Machine learning models can be used to improve this overestimation. So, the Euro score is considered one of the essential features in the data set (Myslivecek and Benedetto 2020). As it mentioned, Mortality prediction is a critical topic after surgery because with this information, many efforts can be made to save lives. Xia et al. (2012) recommended an Artificial Neural Network (ANN) approach using recorded information in the first two days of an ICU to foresee the mortality risk state. In 2000 Thomas G. Dietterich did an experimental comparison among three models, including bagging, boosting, and randomisation, to improve the C4.5 algorithm performance. Moreover, his research's outcome indicates that in conditions with little or no classification noise, randomisation can compete with bagging but is not as accurate as boosting. In conditions with substantial classification noise, bagging has a better performance than boosting, and sometimes better than randomisation. It motivated us to compare different ML models with our new model to enhance its' performance (Dietterich 2000). In addition to comparison, we wanted to know if machine learning can predict the mortality after heart surgery that Umberto Benedetto et al. have done it recently by the use of a neural network, random forest, naïve Bayes, and retrained LR based on features included in the EuroScore (Myslivecek and Benedetto 2020). Also, some researches have been done by researchers about accurate diagnosis and effective treatment in heart disease with the application of ML. In 2017, Amin Poorieh et al. surveyed different ML methods and compared them, including Bayesian, KNN, SVM, Bagging, Boosting, and Stacking to predict heart disease (Pouriyeh et al. 2017). Making better prediction accuracy of mortality can be crucial. Several researches have been done by scientists to prove that using ensemble methods can modify prediction outcomes (Onan 2020). In 2017, Awad et al. (2017) recommended an ensemble learning random forest (RF) and ended up that the proposed ensemble model has better prediction than other classification models. Ghose et al. (2015) Moreover, Darabi et al. (2018) reached similar outcomes by using the ensemble method to foresee the risk of mortality in ICU. These researches demonstrated that using a combination of classifiers can enhance prediction outcomes. Besides, according to Ghorbani and Ghosi's review paper (2019) in the field of predictive models in medical diagnosis, it turned out that scholars had achieved better accuracy and prediction results while using ensemble approaches. Other machine learning classifiers also are refined and compared with various models. One of the most efficient ensemble methods, according to research that has been done by Dehkordi and Sajedi in 2017, to predict depression in older adults, is the Stacking technique. This article's outcome, indicated that the proposed stacking model, which is a combination of K-Nearest-Neighbour, logistic regression, support vector machine, and decision tree, had higher accuracy in comparison to each of them (Dehkordi and Sajedi 2017). Furthermore, this research aimed to compare different single and ensemble classifiers with the proposed new stacking model as regards Rosaida Rosly's research that shows a comparison of different models based on 10-fold cross-validation such as three popular ensemble methods, which are boosting, bagging, and stacking for the combination and some single classifiers such as Naïve Bayes, Multi-Layer Perception and Decision Tree (Rosly et al. 2018). In 2019 Yoshihiko Raita et al. used some classification models, including Lasso regression, random forest, gradient boosted decision tree, and deep neural network to achieve clinical prediction outcomes, and then compared their performance (Raita et al. 2019). Although the shortage of using ensemble models in predicting mortality after heart surgery should be considered, there are two other factors to improve prediction accuracy that should be managed, which are feature importance and solving unbalanced class distribution problems. As regards Muhammad Waqar et al. (2021) used SMOTE to handle imbalanced data set problem for Heart Attack Prediction, which is a common problem especially in mortality prediction case, it is an efficient approach to handle this problem and is used in our research. Furthermore, our research shows that this problem can affect model performance (Fotouhi et al. 2019). Jale Bektas et al. worked on the classification of Real unbalanced Cardiovascular Data by Using Feature Selection and resampling Methods (Bektas et al. 2017). According to the importance of these factors, Roumani et al. (2013) compared the performance of some prevalent machine learning approaches working on imbalanced data set problem. If the dataset has too many features, a high-dimensional problem may appear. In this situation, there are some features that are not essential and influential enough, so dimensionality reduction is required, as Anna Karen Gárate-Escamila et al. (2020), had done in their research for heart disease prediction by PCA and random forest. Mohammad Al Khaldy and Chandrasekhar Kambhampati focused on different feature selection methods and resampling imbalanced class for heart failure data set (2018). It seems apparent that shortage of using ensemble models and handling imbalanced data problem should be considered as a useful approach in real world . A combination of some classifiers can lead to improving prediction results and accuracy compared to a simple classifier. This paper makes an effort to proposed a new ensemble model using the stacking approach to develop a robust early mortality forecasting model while solving highly unbalanced data problems by a combination of two resampling strategies.
There are some innovations and important processes in this paper as compared to similar researches: • Predicting mortality after heart surgery by a new stacking model. • Proposing a new ensemble classifier using the stacking approach. • Solving the problem of highly imbalanced data set using a combination of SVM-SMOTE and Edited-Nearest-Neighbour approach. • Applying both simple validation measures and 10-fold cross-validation methods to implement the validation test. • Comparing the new stacking classifier with different single and ensemble machine learning classifiers. • Measuring the performance of the proposed stacking model with different evaluation criteria such as accuracy, area under the ROC curve, Recall, Precision, and F1-Score.
• Using the Friedman test as a statistical test to analyse the differences among models and demonstrate the best one and validate the results.
The remnant of this article is as follows: Section 2 explains dataset information, data cleaning, a new method to handle imbalanced class problems, and feature selection. Section 3 demonstrates the details about the new stacking ensemble model and comparison with other machine learning models. In Section 4, evaluation approaches are explained as a way to interpret the efficiency of the models. To show the acceptable performance of the recommended stacking model compared to other classifiers, Section 5 indicates the results and analysis. Eventually, Section 5 shows the conclusions and proposes some directions for future researches.
A vital ingredient of the research, is problem perception. This research aims to develop an ensemble model using a stacking approach to predict mortality after heart surgery. It should be mentioned that all the coding process has been done in python, which is a constructive and beneficial language in the field of machine learning. In addition, all the practical experiments have been implemented by a 2.40 GHz Intel Core i7 Lenovo with 16GB of RAM. The applied procedure to acquire the target of this research is portrayed in Figure 1.

About dataset and data cleaning
In this paper, the chosen dataset is gathered and recorded manually from the hospitals related to Shahid Beheshti University of Medical Sciences and Health Services in Iran between 2015 and 2020. This dataset is recorded during, and after a heart surgery(recovery) process that patients were hospitalised. Accordingly, the features are considered constant during the entire study. This data set encompasses 1632 records and 46 attributes after the data cleaning step. First of all, the number of 45 features were accessible in mortality prediction, and the other one was the response variable which is named Mortality. In the next sections, after feature selection step, there will be a dimension reduction.

Normalisation of features
As most datasets in real-world contain markedly diverse features in sizes, units, and range, feature scaling comes up to standardise independent variables. This step is called normalisation. This process should be done because the performance of many machine learning algorithms can be affected by feature scale diversity (Aksoy and Haralick 2001;Ebrahimi et al. 2019). Therefore, in this paper, features are rescaled; as a result, the entire used features have a standard normal distribution with µ = 0 and σ = 1 where µ is the average, and σ is the standard deviation. The formula of this type of rescaling is given below:

A new method to handle imbalanced data
Data preprocessing is a necessary part of data mining and machine learning. The learning efficiency of a classifier depends markedly on the quality of the dataset. Consequently, preparing data is a critical step before entering it into a classifier (Zhang et al. 2003). It should be considered that the introduced dataset had some missing data that it is handled after the data cleaning process. Before preprocessing step, including identifying and handling missing values, normalising numeric features, and deleting ineffective variables, the performance of classifies was so weak and unreliable. A balanced dataset is essential for creating an efficient training set. The unbalanced class distribution is a prevalent problem for real-world medical data and specifically mortality prediction. There is a minority class in predicting mortality because one of the two classes is significantly under-represented in the dataset (Tripathi et al. 2019). Moreover, in this problem, the majority class highly dominates the minority class. Because of this reason, machine learning models tend to assign new observations to the majority class (Liu et al. 2019). Consequently, this problem can cause poor performance of machine learning models in minority class prediction (Liu et al. 2018;Scrutinio et al. 2020). It should be mentioned that the heart surgery dataset is markedly unbalanced. It comprises more samples from majority class that are survived patients (1583 cases of survival), while the minority class is too much smaller (only 49 cases of dead patients). According to this highly imbalanced dataset, it is imperative to handle this problem to help classifiers perform more efficiently and accurately. There are different sampling strategies which are built to handle the unbalanced data problem that can be implemented during this step. Resampling is one of the most valuable methods to create a new dataset through increasing the minority class or decreasing the majority class (Cox 1958). This approach encompasses over-sampling and under-sampling strategies. Over-Sampling techniques increase the number of the minority class members in the training set, and the critical benefit of this technique is that no information from the original training set is lost because all members of both classes are preserved (Onan 2019). The synthetic minority Oversampling technique (SMOTE) is one of the most prevalent and helpful over-sampling methods which can help to solve the unbalanced data problem (Cengiz Colak et al. 2017). This method produces synthetic data according to the feature space similarities among the minority class samples (Devi et al. 2017). SMOTE generates artificial minority class examples to balance the dataset (Ishwaran and O'Brien 2021). This paper tries to solve unbalanced data problems using a combination of SVM-SMOTE over-sampling and Edited-Nearest-Neighbour under-sampling algorithms, which are extensively used in machine learning with imbalanced high-dimensional data that are increasingly used in medicine. SVM-SMOTE method has shown more acceptable performance than other resampling methods (Alghamdi et al. 2017). It helps predictive models to perform more accurately and reliably (Liu et al. 2006). SMOTE resampling technique has different extended methods to handle imbalanced data problems. In 2020, Ghorbani and Ghousi compared them and the results of their research shows that the performance of SVM-SMOTE is better than others, and also, it can improve the performance of classifiers (2020). SVM-SMOTE concentrates on generating new minority class examples near borderlines with SVM to help establish boundaries among classes as regards data and density information which is crucial to synthesise minority classes. After using them individually and comparing them with other balancing methods, it was realised that their combination outperformed in different classifiers. It seems because of the big difference between the number of the two classes, it would be better to reduce the majority class and increase the minority class simultaneously, and comparing such combined methods with single resampling methods on different datasets can be a good future work. This paper combines the SVM-SMOTE with the Edited-Nearest-Neighbour method to prevail unbalanced data problems, and its' destructive influences on machine learning algorithms. It shows how a blend of resampling techniques, including over-sampling and undersampling, can ameliorate the performance of a model (Hansen 1999). In addition, the fact that all the resampling processes should be implemented on the training set is fundamental because otherwise, the synthetic observations can be seen by the classifier, and it can lead to overfitting and unreliable outcomes. It is genuinely compulsory that models only should be tested on unseen data. Therefore, the proposed combination of two resampling strategies including SVM-SMOTE and Edited-Nearest-neighbour is only used on the training set while using hold-out and 10-fold cross-validation. In fact, at first, 25% of the dataset became separated randomly by random hold-out strategy as the test set, and balancing was implemented on the remaining set that is called balanced set in this paper after balancing process. K-Fold cross-validation technique is a helpful technique for data testing to evaluate the performance of classification models, and it determines how the statistical analysis outcomes are assigned to an independent dataset (Normawati and Ismi 2019). This paper uses random hold-out and shuffle 10-Fold cross-validation as two general forms of cross-validation. In the hold-out strategy, 75% of data is randomly assigned into the training set. The remaining 25% are assigned into the test set. Furthermore, in 10-Fold cross-validation, the balanced set is randomly divided into ten equal size subsamples. Then a single subsample is kept away, and the remaining subsamples are used as the training set, and finally the test set is used to evaluate each model. This method repeats this hold-out strategy ten times. It is genuinely compulsory that models only should be tested on unseen data. Therefore, after separation of the test set by random hold-out, the proposed combination of two resampling strategies including SVM-SMOTE and Edited-Nearest-neighbour is only used on the training set even while using 10-fold cross-validation.

Feature importance
In this paper, Random Forest feature importance is used for feature selection. There are some more essential features among different features in a high dimensional data set that can be determined by the feature selection process in the data preprocessing step. This essential concept affects the machine learning model's performance (Chen et al. 2020). By feature selection technique, important and influential features can be ascertained to be used in prediction by machine learning models. Besides, feature selection approaches can lead to dimension reduction of a high dimensional data set by dropping the less effective or noisy features so that the classifiers can be more efficient and precise (Onan 2015). Feature selection techniques are divided into Wrapper and Filter techniques; which Wrapper method uses a classifier to select an optimal subset of features. It is a nonlinear technique That can be interpreted well. It provides feature importance measures by calculating the Gini importance (Saarela and Jauhiainen 2021). In this paper, the selected data set had 45 features related to mortality prediction after heart surgery. After applying random forest feature importance, the importance of features is determined and listed in Table 1. It should be mentioned that the first 18 features are given in the table because the other 27 features had less than 1 percent importance, and they had a slight effect in the performance of machine learning classifiers. Recognising the essential features can lead to dimension reduction (Onan 2016), which can be helpful in mortality prediction. These 18 features predict Mortality status after heart surgery by different machine learning models and the new Stacking model.

Machine learning models
Many machine learning algorithms have been developed to solve various classification problems in different fields of science (Ngiam and Khor 2019), such as Logistic Regression (Cox 1958), Naive Bayes (Kumar et al. 2018), Decision Tree (Du et al. 2002;Wang et al. 2007), Random Forest (Prasad et al. 2006), K-Nearest-Neighbour (Nowicki 2019), Gradient Boosting (XG-boost) (Bauer and Kohavi 1999;Kilic et al. 2020), Artificial Neural Network (Jain et al. 1996) and etc. These algorithms are applied after preprocessing process when the dataset is ready to be learned by the computer. In this paper, some of them are used and compared to the Stacking model. After proper hyperparameter tuning by considering running the models a few times and compare the average outcome for them, all of these machine learning models are given in Table 2. With their efficient parameters after testing different amounts of these parameters on the dataset. Since the algorithms, the goals, the data types, and the data volumes change considerably from one project to another, there is no single best choice for hyperparameter values that fits all models and all problems. Instead, hyperparameters must be optimised within the context of each machine learning project. Additionally, the role of handling imbalanced data can be seen in the performance of different machine learning models according to the good prediction of both classes. It should be mentioned that neural network was not able to predict the minority class. As a result, it is not compared with other ML models in this paper.

Stacking model
This research has proposed a new combination of models based on the stacking algorithm, which uses a meta-model to combine predictions from contributing members. This model is a combination of five single and ensemble classifiers, developed to improve the predicting performance of mortality after heart surgery. Stacking is an efficient ensemble technique because it can balance bias and variance to reduce the total error and achieve better prediction. Ensemble machine learning classifiers' aim is to reduce bias and variance of classifiers to build a robust classifier that can achieve better performance. In the stacking method, the output of the base classifiers (Level 0) will be used as training data for another classifier named Meta-classifier (Level 1) to estimate the same target function. When the training set is ready for the meta-learner, the meta-learner can be trained in isolation on this dataset, and the base-learners can be trained on the entire original training dataset. In this paper, the base classifiers of the proposed stacking model are Logistic Regression, Bagging, Decision Tree, Random Forest, and XG-Boost. The Metaclassifier is Logistic Regression. All of them are selected according to their better individual performance in some criteria rather than other classifiers. Random Forest had the best precision, Logistic Regression had the best F1-Score, Bagging had a suitable accuracy and AUC. After testing different combinations of these models which worked efficiently for some parts of the data, by adding DT and XG-boost to our model, the stacking model showed the best performance in all criteria. Although the original classifiers predict the minority class weakly, they work efficiently for some parts of the data. So, each model works as a booster to sharpen the efficiency of the ensemble model (Mienye et al. 2020). In the stacking ensemble approach, Meta- learner tries to discover how best to combine the output of the base learners. The meta-learner is often simple, providing a smooth interpretation of the predictions made by the base models. So, LR is the best choice for the meta learner. As it can be seen, this paper links two simple classifiers, including Logistic Regression and Decision Tree, as the best single classifiers with Random Forest, XG-Boost, and Bagging as the three best ensemble classifiers to build a powerful ensemble machine learning model. Figure 2 demonstrates the levels and output of the proposed stacking model.

Assessment methods
The evaluating step of machine learning algorithms is an inseparable ingredient of applying classification models. Various measures can be used to assess the performance and validation of machine learning models. In this paper, Accuracy, Area under the ROC curve (AUC), Recall, Precision, and F1-Score are used as metric measurement systems. Besides, a statistical significance test is implemented to survey the differences among different models.

Which model is the best?
There are various classification models. In this paper, the most common and useful models are compared with the proposed stacking ensemble model. Most of the assessment criteria indicate that the Stacking model had better performance and improved the ability of mortality prediction after heart surgery. Model validation procedures used in this research are based on the random hold-out and shuffle 10-fold cross-validation techniques. Table 3 demonstrates the performance of the different models and the introduced stacking model using the hold-out strategy. Besides, for the better interpretation of the differences among the model performances, Figure 3 is prepared to compare the test accuracy results.

Statistical tests
Evaluation methods are the tools for comparing different classification methods. In this research, according to the Shapiro normality test, which is used to survey the normality of data, it turns out that the data does not have a normal distribution. Because the p-value of this test is less than α (α = .05), and the null hypothesis (data is normal) will be rejected. Therefore, in this case, the Friedman test, which is a nonparametric equivalent of the repeated-measures ANOVA, can be used to survey the differences among machine learning models (Friedman 1937). The null hypothesis of this test is that the performance of all models is similar and rejection of it demonstrates that one or more of the paired Classifiers perform differently. This paper uses the accuracy obtained by 10-fold cross-validation in The Friedman test, and it ranks the data of each fold together; then there are some values of ranks for each model that can lead to detecting the most effective classifiers among other models (Friedman 1940).

Hold-Out strategy
In this paper, the hold-out method assigns 75% of the dataset to the training set. The remnant 25% of the data is assigned to the test set. In addition, as mentioned, a combination of SVM-SMOTE as an over-sample strategy and Edited-Nearest-Neighbour as an under-sample strategy is introduced to solve the highly unbalanced data problem in this case. Accuracy is the most prevalent evaluation criterion to show every classifier's performance. Although it is easy to understand, it is not enough to correctly judge the model because many critical factors should be noticed in evaluating the performance of a model. According to Table 4. results, it turns out that the proposed stacking model performs well based on test set accuracy and has improved the accuracy compared to other models. Furthermore, the introduced combination of resampling methods solved the unbalanced data problem efficiently, so the accuracy is more reliable. As it works with ones and zeros, there is uncertainty about quantifying a probability value. As a result, AUC is a crucial evaluation metric that indicates how much a model is efficient in distinguishing between the two classes. Therefore, the higher AUC reveals the better capability of this issue. Stacking model with 93.10% of the AUC metric percentage is the best classifier among the other models. This result means that the introduced stacking model has the ability to distinguish between survival and dead patients with a 93.10% of chance which is better than other classifiers, even Random Forest and Logistic Regression.  Although, Precision and Recall often have a trade-off, both of them should be considered in model evaluation. Most of the time, improving precision reduces Recall (Cleverdon 1966). The Stacking model has improved Recall and Precision together. It achieved 71% with the Recall on average, which is 3% less than the best Recall (Logistic Regression), and 86.5% with the Precision on average, which is the best number among all of the models. The Stacking model identifies 71% of survival and dead patients on average correctly, and when it ascertains a patient as survival or dead, it is correct 86.5% of the time on average. It should be mentioned that the macro average is used to report the models' performance results because when the dataset was imbalanced, models did not predict well the minority class. Therefore, macro average is used, which results in a bigger penalisation and can reveal more fair judgement. The F1-score metric is a measure that can make a more fairly Analysis and comparison among the models by the Recall and the Precision since it is the harmonic average of both metrics. It ascertains how accurate and authoritative is the prediction of the model. In this paper, the macro average of the introduced stacking model achieved the most F1-score among the used models. These results indicate the desirable efficiency of the stacking model. Using Random Forest feature importance shows the effect and importance of each feature in predicting mortality and survival of patients.
The results demonstrate that it is not necessary to record 27 other medical features, and time can be optimised. The maximum amount of Cardiac rehabilitation (Max-CR), Euro score, and Day-Max-CR are the most essential features. So, it turns out the Euro score as a criterion of heart attack probability and cardiac rehabilitation should be more noticed. As it was expected, Age and weight are vital to determine the chance of survival or death after heart surgery. The information about these features is provided in Table 1.

A reliable strategy called K-Fold cross validation
Using K-Fold cross validation can guarantee the good performance of machine learning models due to its structure. Shuffle 10-fold cross-validation is used in this research, which splits the dataset into ten subsets. For each time a single subsample is used as the test set and the others are used as the training sets (Onan and Korukoğlu 2017). The results of applied 10-fold crossvalidation using machine learning models are shown in Table 4, which indicates the obtained accuracy of each fold by different models. According to results shown in Table 4, the introduced stacking model achieved the highest average of 10 folds accuracy of 96.97%, with a low amount of .7% variance, and it had a slight difference with the hold-out strategy, which means its' accuracy is reliable, and the model performed acceptably. The accuracy resulted from 1st fold, 96.88% was the lowest. The highest accuracy is related to the third fold, 97.03%. Random Forest model achieved the second-highest accuracy average, 96.87, and according to the whole assessments and validations results, it is the latter most well-performed model compared to others. It is better to be mentioned that the proposed stacking model was the best in almost all validation and assessing metrics. It achieved the best rank in all of them except in Recall that achieved the second stage after the LR model. In addition, Figure 5 shows the comparison among machine learning classifiers while using 10-fold cross-validation and hold-out strategies. The difference between Naïve Bayes results by using the two strategies indicates this model's hold-out accuracy is not reliable, and it may not perform as well as it seemed. Other models' 10-fold cv accuracy emphasise their hold-out accuracy, especially stacking, Random Forest, and Logistic Regression models. Cross-validation is often the preferred method because it gives a better demonstration of how well the model will perform on unseen data.

Interpretation of statistical tests
Statistical significance tests can be used as one of the valuable ways to select the best model. The first assumption of the ANOVA test is that the 10-fold cv samples are drawn from  Figure 3. A comparison of test accuracy results for models based on the 75/25 random hold-out strategy. a normal distribution. The Friedman test is suitable to compare machine learning classifiers in this case. This paper's assumption of the ANOVA test is violated because the Shapiro statistical normality test reached a p-value = 0, which is less than α = .05. Therefore, the null hypothesis which is the normality of data is rejected, and the ANOVA test can't be used. Table 5 represents the result of the Shapiro test. Table 6 indicates the results of the Friedman test. The null hypothesis of this test is rejected because the p-value = 0 is less than the significance level of .05. It means that at least one of the classifiers performs differently. Table 7 demonstrates the median and sum of ranks achieved by the Friedman test. The median indicates the midpoint value, which is the point where half the data points are above it and the other half is the other way round. In addition, the median of all data points is shown as the overall median. Besides, the median response for the stacking model is higher than the overall median. The result of the sum of Ranks shows that the stacking model is better than all classifiers except Logistic Regression.

Conclusion
Nowadays, a vast amount of data is collected and produced by the development of health care systems and biomedical equipment. Data processing and releasing valuable patterns and information is a way to save many lives. Predicting mortality after heart surgery is the essential concept in medical data mining. An accurate prediction of the mortality status of patients waiting for heart surgery can give helpful information to save lives and lessen costs and time; so, it is compulsory to utilise it in patients as soon as possible. This paper tries to propose a stacking ensemble model, which uses features selected by Random Forest feature importance technique to provide an early mortality prediction while solving the unbalanced data problem by using a combination of SVM-SMOTE over-sampling strategy and Edited-Nearest-Neighbour under-sampling technique. The random hold-out and shuffle 10-fold cross-validation are used as two validation procedures to assess the machine learning models' stability and performance. Moreover, a statistical test called the Friedman test is used as another measure of performance. After solving the highly unbalanced data problem, wellknown models and the introduced stacking ensemble model are applied using the random hold-out method on balanced data. According to the assessment outcomes, the introduced stacking model has acceptable performance and outperforms all other models using various assessment metrics. It should be mentioned that Random Forest and Logistic Regression have an acceptable performance as well.
The evaluation results of machine learning models obtained using the shuffle 10-fold cross-validation method indicate similar results to choose the best model. The introduced stacking ensemble model achieved higher accuracy than other models with a low figure of acceptable variance. Using the stacking model, which is a combination of some single and ensemble classifiers, improved the mortality prediction ability after heart surgery according to different validation and assessment techniques. The stacking model achieved the best performance in all the measures except Recall. In Recall, it achieved the second stage. Therefore, it seems that the new model had the best performance. It should be considered that RF and LR were the two other efficient models after the stacking model. There have been other researches that Stacking model outperformed other classifiers such as 'A novel stacking technique for prediction of diabetes' (Kalagotla et al. 2021) and 'A New Hybrid Predictive Model to Predict the Early Mortality Risk in Intensive Care Units on a Highly Imbalanced Dataset' . The maximum amount of Cardiac rehabilitation (Max-CR), Euro score, and Day-Max-CR are the most essential features. So, it turns out the Euro score as a criterion of heart attack probability and cardiac rehabilitation should be more noticed. As was expected, Age is vital to determine the chance of survival or death after heart surgery. Many ways are existed to refine and sharpen this research. Future researches can be done to make more powerful models. Also, other combinations of resampling techniques can be implemented to handle imbalanced data problems, and comparison of them may be a good future work.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Data availability statement
Due to the nature of this research, participants of this study did not agree for their data to be shared publicly, so supporting data is not available.