Clustering of Cardiovascular Disease Patients Using Data Mining Techniques with Principal Component Analysis and K-Medoids

Cardiovascular disease is the number one cause of death in the world and Quoting from WHO, around 31% of deaths in the world are caused by cardiovascular diseases and more than 75% of deaths occur in developing countries. The results of patients with cardiovascular disease produce many medical records that can be used for further patient management. This study aims to develop a method of data mining by grouping patients with cardiovascular disease to determine the level of patient complications in the two clusters. The method applied is principal component analysis (PCA) which aims to reduce the dimensions of the large data available and the techniques of data mining in the form of cluster analysis which implements the K-Medoids algorithm. The results of data reduction with PCA resulted in five new components with a cumulative proportion variance of 0.8311. The five new components are implemented for cluster formation using the K-Medoids algorithm which results in the form of two clusters with a silhouette coefficient of 0.35. Combination of techniques of Data reduction by PCA and the application of the K-Medoids clustering algorithm are new ways for grouping data of patients with cardiovascular disease based on the level of patient complications in each cluster of data generated.


Introduction
Cardiovascular disease is the number one cause of death in the world. Cardiovascular disease is a group of diseases caused by impaired heart and blood vessel function. Examples of diseases that are categorized as cardiovascular disease are coronary heart disease, cerebrovascular disease, arterial peripheral disease, rheumatic heart disease, congenital heart disease, and deep vein thrombosis and pulmonary embolism [1]. According to the Ministry of Health of the Republic of Indonesia, quoting from WHO, around 31% of deaths in the world are caused by cardiovascular diseases and more than 75% of deaths occur in developing countries [2]. Cardiovascular disease is a type of non-communicable disease caused by a combination of risk factors that cannot be modified and risk factors that can be modified. Risk factors that cannot be modified are risk factors that cannot be changed such as age, and gender. Modifiable risk factors are factors that can be changed through individual behavior such as smoking, alcohol consumption, poor diet and lack of physical activity. The combination of these two factors causes metabolic disorders such as increased glucose levels and cholesterol in the blood and will increase the risk of cardiovascular disease [3]. In the past 25 years, obesity and diabetes mellitus have overtaken cigarette smoking, dyslipidemia, and hypertension as risk factors for coronary heart disease [4]. Author also conducted an interview with interviewees who work as doctors. The results of the interview concluded that cardiovascular disease is a disease related to the metabolic system of the body that can be complicated by other organs such as the kidneys that function to filter dirty blood and dispose of it through urine.
One problem in the health sector is the large number of documents recorded by medical examinations of patients [5]. Based on the author's interview with the interviewees, examination of patients suffering from cardiovascular disease produces many medical records considering that cardiovascular disease can be caused by the functions of other organs. Through this medical record, several conclusions can be drawn that can be used for medical treatment based on illness, symptoms and treatment aimed at patients [5]. Therefore, it is necessary to develop a system so that the results of medical records, especially in patients with cardiovascular disease can be utilized optimally. The system in question is a computer-based decision making system that was developed using the process of pulling useful information from a collection of data and turning it into a new structure.
This system is called data mining. The new structure generated through the data mining system can be used for further analysis [6].
One of the techniques contained in data mining is clustering. Clustering technique can be defined as the process of dividing a collection of data objects (observations) into a new subset so that data objects (observations) in that subset have similarities. One of the basic algorithms used in the clustering process is partitioning method which divides a group of objects into a number of groups that have been determined. One of the clustering algorithms included in the partitioning method is K-Medoids or Partitioning Around Medoids (PAM). The K-Medoids method is the development of K-Means to overcome the presence of outliers [7]. The K-Medoids method uses an object that will be called medoids as the focal point of the cluster.
One of the problems in clustering is high dimensional data or data that has many attributes. As dimensions increase, data becomes more scattered because data points are located in different dimensions [8]. Several methods are used to overcome this problem, one of which is to reduce the dimensions of the data using Principal Component Analysis (PCA) [9]. PCA method is a method is a process in which data that has many dimensions are linearly transformed into a collection of variables that are not related and sorted descending based on variance per component. By using the PCA method, large dimension data can be explained using a number of smaller components that have been formed [10].
Several studies on clustering that combine PCA and K-Medoids methods have been carried out. This combination of methods has been used in the energy field to increase the effectiveness of wind powered energy turbines [11]. PCA is used to reduce variable dimensions that are considered to make wind turbines generating energy ineffective work which is then followed by the selection of the best clustering method. The clustering method used is Fuzzy C-Means, K-Medoids and K-Means which will be evaluated using RMSE. The conclusion of the study stated that the best clustering method was K-Medoids because it produced the smallest RMSE value. In the health sector, a combination of PCA and K-Medoids methods has been used to improve medical diagnoses of patients suffering from cardiovascular disease, Parkinson disease and liver disorders [12]. This study uses PCA as the initial step of the study and then uses K-Medoids to group these patients. The study concluded that the combination of PCA and K-Medoids could help to improve medical diagnoses of patients suffering from cardiovascular disease, Parkinson disease and liver disorders. Through this explanation, this study combines the PCA method and the K-Medoids algorithm to cluster patients with cardiovascular disease. The purpose of using both methods is to determine the characteristics of patients with cardiovascular disease found in each cluster based on the level of complications. This is expected to improve the quality of treatment of patients with cardiovascular disease.

Data Sources and Research Variables
This study uses secondary data obtained from a private hospital in Jakarta. The data obtained are 644 observations. This study uses age variables as well as 8 variables of the results of blood tests of patients with cardiovascular disease presented in Table 1.

Normalization of Z-Score
Normalization is the process of transforming data to equate a range of values with a certain scale. Normalization of z-score is a normalization method based on the average value and standard deviation of the data [13]. The formula of z-score can be written as follows: where:

Principal Component Analysis
Principal Component Analysis (PCA) is a multivariate analysis technique introduced first time in 1901 and developed by [14]. The main idea of making PCA techniques is to reduce the dimensions of data sets that have many interconnected variables, but still maintain as many variations as they appear in the data set. This dimension reduction is obtained through transformation to a new set of variables called principal components, which are uncorrelated and sequential so that some initial components retain variations that appear within the original variable [15].
The process of forming major components in PCA uses values of eigen and vector eigen.
In determining the value of eigen of a matrix, such as the X matrix, the equation used is the following equation [16]: where: Calculation of vector eigen on matrix X can be used as follows: where: The value of eigen formed based on equation (2) is used to determine the proportional variance of each component formed. The proportion of variance is used to find out how big a component is to explain the diversity of data [17]. Calculation of the proportion of variance can use the following formula: where: = variance of proportions of components to -i = Value of eigen on component to -i There are several criteria that are used as a reference to determine the number of main components taken. One of these criteria is to take components that have cumulative proportional variance values between 80% to 90% [17]. Calculation of the value of the cumulative proportion can be obtained using the following formula: 5. Calculate the total cost (S) from exchange of medoids oj with orandom.
6. If S < 0, exchange oj with orandom to form a group k new observations as medoids.
7. Repeat the second until the fifth step until there is no exchange.

Research Stages
The research process begins with cleaning out incomplete medical record data. The process is continued by normalizing the data using values of z-score. The next step is to reduce the dimensions of the data using Principal Component Analysis to produce new components that will be used for the clustering process using K-Medoids. The research process ends with an evaluation of the results of the clustering with interviewees to determine the level of complications of patients suffering from cardiovascular disease. The research process is illustrated in Figure 1.

Results
The entire calculation and analysis process in this study was carried out using R software.
Descriptive statistical values (minimum value, quartile 1, median, mean, quartile 3, maximum value and standard deviation) of each variable are presented in the following Table 2:  (11,097) indicates that the data value is quite varied. Explanation of descriptive statistics from other variables can follow the previous description.
In this study, the boxplot graph is used to detect the presence or absence of outlier values in each variable [19]. Boxplot of each variable is presented in Figure 2 below: The next step is to normalize the data using z-score (1). The initial stage of calculating the zscore is to determine the average value and standard deviation of the data with the example shown in Table 3 below: Calculation process of z-score is done to all variables with the same process as in Table 3. The five initial data normalized results are presented in Table 4 below: The next process is to reduce the dimensions of the normalized data using PCA. The PCA process will produce an eigen value or lambda (λ) with the proportion variance as well as the cumulative proportion variance presented in Table 5 below: The proportion variance for PC1 in Table 2 shows that PC1 can explain the diversity of data by 0.2994. PC2 can explain the diversity of data by 0.1846 and so on until PC9. The cumulative proportion variance value on PC1 includes diversity of 0.2994 while the cumulative proportion variance value will increase to 0.4839 if PC1 and PC2 are taken. The cumulative proportion value will be 1 if PC1 to PC9 is taken. In Table 2 it can be seen that PC1 through PC5 has illustrated the diversity of data of 0.8311 so that the components of PC1 through PC5 that are formed will be used for the clustering process using K-Medoids.
Determination of the best number of clusters uses the calculation of the value of the silhouette coefficient (6) and the value of the silhouette coefficient is obtained as shown in Figure 3 below:

Figure 3 Value of Silhouette Coefficient
Through Figure 3, the number of the best clusters formed is two clusters because it has a value of silhouette coefficient of 0.35 and the highest among the number of other clusters. The next step is to carry out the clustering process using K-Medoids and this process classifies 503 patients into cluster 1 and 141 patients in the cluster 2. The distribution of variable data in each cluster is presented in Figure 4 below:  Table 6: Through Table 6 the mean and median values of the AGE variable in cluster 2 are higher than in cluster 1. This can be a concern because the higher age can cause plaque to stick to the walls of the heart and cause disruption of bloodstream through it [20]. The mean and median values of the UREA variable in cluster 2 are higher than cluster 1. However, both the mean and median values of the UREA variable in each cluster are still at normal levels or below 40 mg / dL [21]. The mean and median values of the CREA variable in cluster 2 are higher than cluster 1. However, both the mean and median values of the CREA variable in each cluster are still in normal numbers or below the level of 1.4 mg / dL such as previous research conducted by [21]. The mean and median values of the UA variable in cluster 2 are higher than cluster 1. In cluster 1 the mean and median values of the UA variable are still at normal numbers or below the level of 6.3 mg / dL but the mean and median values of the UA variable in cluster 2 have passed the normal limit. The mean and median values of the CHOL variable in cluster 2 are higher than cluster 1. In cluster 1 the mean and median values of the CHOL variable are at normal numbers or below 200 mg / dL but the mean and median values of the CHOL variables in cluster 2 have passed the normal limit. The mean and median values of the TRIG variable in cluster 2 are higher than cluster 1. In cluster 1 the mean and median values of the TRIG variable are normal or below 200 mg / dL but in cluster 2 the mean values are above the normal limit and the median does not cross the normal limit. The mean and median values of the HDL variable in cluster 1 are higher than cluster 2. But both the mean and median values of the CREA variable in each cluster are still at low levels or below 65 mg / dL. The mean and median values of the GLU variable in cluster 2 are higher than cluster 1. In cluster 1, the mean and median values of GLU variables are in normal number or below 110 mg / dL, but the mean and median values of GLU variables in cluster 2 have passed the normal limits. In the GLU2J variable, the mean and median values in cluster 2 are higher than cluster 1. In cluster 1, the mean and median values of the GLU2J variable are normal or below 140 mg / dL, but the mean and median values of the GLU2J variable in cluster 2 have passed the normal limit.

Conclusion
Data reduction technique with PCA from eight variable data of blood test of patients with cardiovascular disease and age variables obtained from 644 observations, can produce new five components with adequate data diversity Through five new components resulting from data reduction, and the implementation of the K-Medoids algorithm, two data clusters can be produced for patients with cardiovascular disease with silhouette coefficient values which indicate low data density levels. Patients in the first cluster has lower rates of cardiovascular complications compared to patients in the second cluster, which is due to the lower distribution of data values for each variable AGE, UREA, CREA, UA, CHOL, TRIG, GLU and GLU2J.
The combination of data reduction techniques with PCA and the application of the K-Medoids clustering algorithm is a new way of data mining to group data of patients with cardiovascular disease to see the level of patient complications in each different data cluster. There is still a lack of cluster evaluation results shown by the value of the Silhouette coefficient (SC) which has a weak structure with a value of 0.35 therefore it is necessary to develop further research methods to produce clusters with stronger structures.