Prediction of Female diabetic Patient in India using different Learning Algorithms

: Diabetics or Diabetic Mellitus is a metabolic disorder of blood sugar levels in the human body. It is a major non-communicable disease and involved many serious health risk issues. This disease is rapidly increasing in India. It is a chronic condition and it occurs when a body doesn't produce enough insulin hormone to control the blood sugar level. In this study, different variables have been analyzed that cause the diabetics, and different machine learning algorithms are used to predict whether an unknown sample is diabetes or not. For this purpose, PIMA diabetic detection for Female patients was used. Here 10 different classification model is used for prediction. Finally, the detailed performance analysis of the different variables of the PIMA dataset and also the classification model are discussed.

Motivation and Aim of this study. Diabetes is one of the most common and serious health issues today. The occurrence of diabetic patients increasing day by day and most of them are female patients [6]. In several research it was observed that the factors like Body Mass Index, Blood Pressure, Insulin Level, Cholesterol are the main for causing diabetes. For female patients pregnancies is an additional but also an important factor. This study shows the behavior of the different key factors for diabetics patients and also the relationship between the main key factors.
The aim of this study is to predict whether a patient is diabetic or not, particularly female patients, based on different machine learning algorithms.

II. LITERATURE REVIEW
In one research article, the authors made a comparative analysis of different machine learning algorithms. They evaluated the performance of the different machine learning algorithms using the PIMA diabetics dataset of female patients in India. They show the random forest classifier gives more than 74% accuracy [7].
The authors in their study used the PIMA diabetics dataset with the other dataset collected from Kurmitola General hospital in Bangladesh. They used different machine learning algorithms to perform the prediction. Finally shows the Naïve Bayes algorithm gives the best result among them [8].
Other authors used this PIMA diabetics dataset in their study to predict the disease. They also used the different machine learning algorithms to calculate the classification accuracy, precision, F1-score, and accuracy under the ROC curves [9].
In one research article, the authors investigate the prediction of diabetics based on the input features FPG and HbA1c.
They used five different machine learning algorithms with hierarchical clustering, feature elimination, and feature permutation techniques. They identify different risk factors that are indirectly involved with diabetes classification [10].
The authors in their study used diverse machine learning algorithms on the PIMA diabetic prediction dataset. They used the classifiers Artificial Neural Network (ANN), Naive Bayes (NB), Decision Tree (DT), and Deep Learning (DL) and achieved 90% to 98% accuracy. They also show that the deep learning approach achieved the maximum accuracy, 98.04% [11].

III. DATASET OVERVIEW a. DATASET
The PIMA diabetic dataset[12] of Indian female patients was downloaded from Kaggle. This dataset was originally collected from the National Institute of Diabetes and Digestive and Kidney Diseases. The purpose of this dataset is to diagnosis whether a patient is diabetic or not based on certain measurement parameters. All the patients in this dataset are female and at least 21 years old of Pima Indian heritage.

b. DATASET DETAILS
This dataset [13] contains a total of 768 data of female patients. Among them, 500 female patients are diagnosed as non-diabetic and 268 female patients are diabetic. The diagnosis result is stored as the binary values in the dataset for each patient with the other attributes. Table 1 and figure 1 show the dataset summary and the diagnosis details of the PIMA dataset.    Figure 2a shows the boxplot of no of pregnancies of diabetic and non-diabetic female patients. Figure 3a shows the bar plot of diabetic and non-diabetic patients grouped by the number of time pregnancies of female patients.
4. Skin Thickness. Human skin thickness is determined by collagen. It is produced underneath the skin. Depending on the insulin level the skin thickness is determined. Figure 2d shows the boxplot of the female patients and 3d shows the percentage level of skin thickness of the female diabetic and non-diabetic patients.

Insulin.
It is a hormone that keeps balance the blood sugar level in the human body. After 2 hours of consuming a meal, the normal insulin level is less than 200 mu U/ml. Figure 2e shows the box plot of the insulin level of diabetic and non-diabetic female patients and 3e shows the percentage level.
6. BMI (Body Mass Index). This variable is used to measure body fat. For a diabetic patient, this is an important measuring parameter. Figures 2f and 3f show the box plot and the percentage level of BMI of the female diabetic and non-diabetic patients.

Diabetic Pedigree Function (DPF).
It is a parameter that gives the report about the patient's family diabetic history.
This function provides the relationship between the genetic and non-genetic relatives' diabetic status.   From the correlation matrix plot, it is observed that the dataset has five important predictive variables. They are, 1) No.
of Pregnancies, 2) Glucose level, 3) Insulin level, 4) BMI (Body Mass Index), and 5) Age group. Again, the BMI level also correlated with two other variables 1) Skin Thickness, and 2) Blood Pressure level. The relationship among all the variables is depicted in figure 5 (a-d).

IV. METHODOLOGY
The process to predict a female patient diabetic or not is depicted in figure 6. For this study, 10 different machine learning algorithms are used to check the prediction rate. First, the dataset was pre-processed to be used for the machine learning algorithms. Then the dataset split into training and testing sets. After that, the algorithms are applied to the training set to create the trained dataset. Finally, the trained dataset is applied to the test set to predict the outcome.  algorithms are used to calculate the prediction performance. Finally, a detailed comparative analysis of the algorithms is also performed. Here only one dataset is used and the number of observations in this dataset is quite small. Therefore, the future work is to use a large dataset and also apply the deep learning algorithms for better performance.