An Effective Heart Disease Prediction Model based on Machine Learning Techniques

This paper presents an effective heart disease prediction model through detecting the anomalies, also known as outliers, in healthcare data using the unsupervised K-means clustering algorithm. Most existing approaches for detecting anomalies are based on constructing profiles of normal instances. However, such techniques require an adequate number of normal profiles to justify those models. Our proposed model first evaluates an optimal value of K using Silhouette method. Next, it intends to locate anomalies that are far from a certain threshold distance with respect to their clusters. Finally, the five most popular classification techniques such as K-Nearest Neighbor (KNN), Random Forest (RF), Support Vector Machines (SVM), Naive Bayes (NB), and Logistic Regression (LR) are applied to build the resultant prediction model. The effectiveness of the proposed methodology is justified using a benchmark dataset of heart disease.


Introduction
The modern digital world is overwhelmed by an unprecedented amount of data in various domains [17]. Most organizations usually have no problem in capturing the ample amount of data, but the challenging task for them is to elucidate and extract the required meaningful knowledge or information from these vast amount of data in an efficient manner. Several data mining and machine learning techniques are used to find interesting and meaningful relationships among data [14] [18]. One such technique is known as anomaly detection. Anomalies are the data characteristics that are different from normal behaviors [19]. In some cases, such anomalies are considered as noise or outliers that affects on a machine learning based prediction model [15]. Detection of anomalies has recently occupied an overwhelming research interest owing to its necessity in various domains to get critical actionable information from large datasets.
Recently, some studies have focused on anomaly detection from larger datasets [10,21,3,16]. An elaborate discussion on these methodologies is given in Section 2. The common phenomenon of most of the existing research works is to construct a profile of normal instances which is considered as challenging task as it requires to find a sufficient number of normal profiles.
In this study, we present a model to detect anomalies in the healthcare domain based on the K-means clustering algorithm where the optimal value of K is measured using Silhouette method. The major advantage of our proposed approach is that it does not require to construct the normal profiles or knowledge of previous anomaly records in the heart disease training dataset. Usually, the anomalous instances lie in sparse or small clusters and thus far from the centroids of their respective clusters. We apply five popular machine learning classification techniques [20] such as K-Nearest Neighbor (KNN), Random Forest (RF), Support Vector Machines (SVM), Naive-Bayes (NB), and Logistic Regression (LR) to measure the effectiveness of the proposed methodology in anomaly detection in order to predict heart disease.
The rest of this paper is organized as follows: we present the related works on anomaly detection and heart disease prediction in Section 2. Section 3 provides the details of the proposed anomaly detection model. Our experimental results with discussions are covered in Section 4. Finally, we conclude the paper in Section 5 by providing the future directions of this research.

Related Work
A considerable amount of research has been devoted to the problem of anomaly detection. For instance, Janakiram et al. [6] proposed a Bayesian belief networks based anomaly detection to detect the anomalies. Steven Mascaro et al. [7] proposed a model that detects anomalies using Bayesian networks. Their Bayesian network model learns anomalies from real-world Automated Identification System (AIS) data.
Zhiguo et al. [2] presented a novel anomaly detection framework based on Isolation Forest, in which they used the frame of sliding windows and also considered the concept of drift phenomenon. Ranjith et al. [11] proposed an unsupervised anomaly detection model using the DBSCAN algorithm. They tried to find out anomalies from a traffic dataset, in which a trajectory is said to be an anomaly if it does not fit with the trained model. Munz et al. [9] also tried to find anomalies in a traffic dataset using K-means clustering algorithm.
There are numerous studies on heart disease detection. Safial Islam et al.
[1] did a comparative study on coronary artery heart disease prediction, in which they applied several data mining techniques like Logistic Regression (LR), Support Vector Machine (SVM), Deep Neural Network (DNN), Decision Tree (DT), Naive Bayes (NB), Random Forest (RF), and K-Nearest Neighbor (KNN). Mohan et al. [8] proposed a novel method that aims to improve the prediction accuracy by finding significant features. The prediction model considered different combinations of features and then the performance of the model is evaluated by applying several known classification techniques.
Our proposed anomaly detection model is based on unsupervised approach where we apply the optimal K-means clustering algorithm to cluster anomalies in the heart disease data. The optimal cluster value of K has been estimated using Silhouette method, and classification techniques are used to effectively predict heart disease by removing anomalies.

Methodology
In this section, we present the details of our proposed anomaly detection model, which has four different modules as presented in Fig. 1.

Data Preparation Module
Instances in healthcare datasets typically contain several healthcare features and related facts that can be used to build the model. In this work, we use a Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 30 November 2020 doi:10.20944/preprints202011.0744.v1 benchmark heart disease dataset available in Kaggle [12]. This dataset contains 303 instances. Each instance has 13 features of which twelve are "Integer" type and one is "Float" type. Our heart disease dataset has a small number of attributes with missing values and we use the imputation strategy. There are several ways to impute missing values. In our approach, we impute the mean value for each attribute.

Clustering Module
K-means Clustering: The K-means algorithm [4] is an unsupervised clustering algorithm. It takes the number of clusters and the dataset as input and gives the output as a set of clusters. The K-means algorithm determines the center of a cluster as the mean value of the instances within that cluster. First, it randomly selects K from the instances, which represents a center or cluster means in the dataset. Each of the residual instances are assigned to the nearest cluster based on the Euclidean distance between the instance and the cluster mean. Next, it recurrently optimizes the positions of the centers for all the clusters.
For each cluster, it calculates the new mean as a center using the instances assigned to that cluster in the previous iteration. All the instances of each cluster are then reassigned to the updated means. The iterations continue until the algorithm has converged, i.e., centers of all the clusters don't need any more repositioning. Different values of K can lead to different results, and so it is important to find an optimal value of K.
In this study, we apply the Silhouette method to find the optimal value of K from the heart disease dataset. Given a range of K values, the Silhouette method computes the Silhouette score, i.e., Silhouette Coefficient for all the instances. The Silhouette Coefficient for an instance is calculated using Eqn. 1 [13]. In this equation, a indicates the mean distance between the instances within-cluster, and b is the mean distance between the instance and the nearest cluster/s. The Silhouette Coefficient value ranges from -1 to +1, where +1 indicates the best cluster fit and -1 means the worst cluster fit.

Anomaly Detection Module
In our proposed approach, we calculate the anomaly score of an instance (as shown in Eqn. 2) based on the distance between the instance and the center of its nearest cluster [5].
In this formula, distance(o,C o ) represents the distance between instance o and cluster center C o , whereas L indicates the mean distance of that cluster. So Anomaly Score in Eqn. 2 measures the ratio of the distance of each instance from the cluster center to the mean distance of that cluster. The further away an instance o from the center of it's cluster, the more likely that o is an anomaly instance. Next, we calculate the minimum anomaly threshold score and maximum anomaly threshold score for each cluster using Eqn. 3 [22] and Eqn. 4 [22], respectively, where Q1 represents 25th percentile of the data and Q3 represents 75th percentile of the data. The Interquartile range (IQR) is the difference between Q3 and Q1 as shown in Eqn. 5 [22].
Finally, all the instances having an anomaly score greater than Max Threshold (M axT ) or less than Min Threshold(M inT ) will be detected as an anomaly.

Prediction Module
In this module, we apply five different classifiers [20] such as Logistic Regression (LR), Support Vector Machine (SVM), Naive Bayes (NB), Random Forest (RF), and K-Nearest Neighbor (KNN) on the heart disease dataset with and without anomalies to measure the effectiveness of our model.

Experimental Setup
All the experiment is conducted on Intel Core i5 2.50GHz processor with 8 GB RAM. The proposed model is implemented in Python with packages scikit-learn under OS Windows 10.

Implementation and Experimental Results
As stated earlier, we first process our heart disease data by replacing missing values with imputation. Then, we plot the K vs. Silhouette score graph (as shown in Fig. 2) in order to choose the optimal K value before applying the K-means algorithm. It is observed from Fig. 2 that the cluster value K of 2 has the highest Silhouette score.
Next, the K-means clustering algorithm is applied to heart disease data to cluster all data instances. After clustering all the data instances, we determine the mean of each cluster in order to measure the anomaly score of each data instance. Next each instance that has a score greater than M axT or less than M inT is detected as an anomaly instance and therefore removed those instances. A scatter plot of original heart disease data with anomalies (labeled as a red-colored square) is shown in Fig. 3a. Fig. 3b shows that our proposed model removes all the anomaly instances successfully.
We apply five classification models on heart disease data with and without anomalies to measure the performance of our proposed anomaly detection model. The classification models that are applied are K-Nearest Neighbor (KNN), Random Forest (RF), Support Vector Machines (SVM), Naive-Bayes (NB) and Logistic Regression (LR) to evaluate the proposed models in terms of accuracy, precision, and recall metrics. Table 1 presents the performance comparison of five different classifiers on heart disease data with and without anomaly instances. We see that the performance of RF, SVM, LR classifiers are better in the dataset with no anomaly instance compared with the performance in the original dataset with anomaly instances. The other two classifiers achieve similar accuracy values for the dataset with and without anomalies. In Fig. 4, We also observe that RF outperforms other classifiers in terms of accuracy, precision, and recall values for the experiment results performed on the dataset without anomalies. Thus, experiment results prove the effectiveness of our proposed K-means based anomaly detection model for heart disease data.
In addition, we plot the Receiver Operating Characteristic (ROC) curve of five classifiers for the dataset with and without anomalies to evaluate the performance of our anomaly detection model as shown in Fig. 5a and in Fig. 5b. From 5b, it is proven that RF, SVM, LR, and NB have a better area under the ROC curve (AUC) of values 0.917, 0.837, 0.925, and 0.900, respectively.

Conclusion
We have presented an effective heart disease prediction model based on machine learning techniques. The effectiveness of our model has been evaluated with and without anomalies using various classifiers. Experiment results showed that RF, SVM, LR classifiers achieved better accuracy in the dataset without anomalies compared with dataset with anomaly instances. Again, our anomaly detection model is able to effectively recognize the anomalies in the data. In future, we will focus on additional experiments to measure the effectiveness of our model, and also on the model effectiveness in other application areas like IoT systems.