1. Introduction
Bipolar disorder is a severe mood disorder characterized with alternating episodes of depression and mania [
1,
2]. During periods of mania, patients may exhibit unusually energetic, happy, or irritable behavior, and have reduced sleep. During depression, patients may cry inexplicably, have a negative attitude toward life, and have poor eye contact with others. According to statistics, 6% of patients with bipolar disorder die by suicide, and another 30-40% suffer from self-harm. Many patients with bipolar disorder also suffer from other mental illnesses, such as substance abuse addiction, anxiety disorders, etc. According to academic researches, people with bipolar disorder account for about 1% of the global population [
3]. In the United States, approximately 3 percent of the population experience bipolar symptoms at some points in their lives, with no significant gender differences [
4]. The most common age for onset of symptoms is between 20 and 25 years old. The younger the age, the worse the prognosis [
5].
The combined action of many genetic variations may lead to the development of bipolar disorder [
1], and genetic factors account for about 70-90% of the risk of bipolar disorder [
6,
7]. Environmental risk factors include the history of childhood abuse and chronic stress [
1]. In addition, many other psychiatric disorders share symptoms of bipolar disorder, including attention deficit/hyperactivity disorder, schizophrenia, substance abuse, etc. [
1]. On the other hand, about one-quarter to one-third of people with bipolar disorder experience economic, social, or professional problems [
1]. According to the survey provided by WHO, bipolar disorder is one of the top 20 reasons of disability in the world, and causes a huge burden on society [
8]. Moreover, some diseases have a higher incidence in patients with bipolar disorder compared to the general population, including coronary heart disease, metabolic syndrome, migraine, obesity, and type 2 diabetes. Accordingly people with bipolar disorder have twice the risk of death compared to the general population [
1,
5]. For a recent study conducted between January 2018 and January 2020 at a hospital in Turkey, each of the 1148 patients with bipolar disorder was interviewed to investigate the incidence of various target diseases in his/her first- and second-degree relatives as well as himself/herself. It was found that if there is a family history of epilepsy, the patient's symptoms of mental illness will be more pronounced. Similarly, a family history of diabetes mellitus is strongly associated with bipolar disorder, and a family history of thyroid disease is correlated with co-occurring anxiety disorders. Finally, there exists a co-morbid association between bipolar disorder and cerebrovascular disease [
9].
There is an intuitive way to improve the performance of machine learning. Different models can be trained respectively using the same dataset, then their prediction outcomes will be integrated. However, such ensemble learning may consume computational resources. Moreover, for deep neural networks trained on images, it has been generally observed that the learned features are similar to Gabor filters and color patches. Therefore, the concept of transfer learning is proposed. In the applications of computer vision, the technique of transfer learning is constantly used in problems such as object detection and target segmentation. A common practice for transfer learning is to train a basic convolutional neural network using the input dataset firstly. Then its convolutional layers, the earlier layers of the network architecture, and/or the connection weights, are duplicated in the target network [
10,
11]. Similar to the concept of transfer learning, the idea of “knowledge distillation” is proposed and its effectiveness has been verified in various studies. For the practice of knowledge distillation, firstly a sophisticated model or multiple models will be trained using any learning algorithm, such as the deep neural network. Outcomes produced by this group of “teacher models” can be thought of as conditional distributions for the input data, and may be referred to as “soft labels”. These data distributions can be used as the learning targets for the “student model”, which will be trained using simpler learning architectures [
12,
13]. The evaluation results have shown that the student model with a simpler architecture can achieve prediction performances close to those produced by complex learning architectures. On the other hand, the soft labels can be used as the reference information and to train the student model together with the original input data. This process may also be seen as the student model "distilling" the "knowledge" provided by the group of teacher models [
14].
The kernel density estimation (KDE), which is a nonparametric estimation approach in statistics, has been widely exploited to identify distributions in various types of datasets. A kernel density estimator generates an approximate probability density function (PDF) by computing the linear combination of the weighted kernel functions placed at the locations of all data instances in the vector space. Accordingly, variations in the vector space with different PDF values can be identified as distributions of data instances [
15,
16,
17]. In this article, we will report how the KDE method performed with a real medical dataset and how it has been exploited to identify distributions hidden in the data. Moreover based on the concept of knowledge distillation, the PDF values produced by the KDE method were then transferred as the soft labels to construct the prediction models of bipolar disorder using learning methods of decision tree and artificial neural network respectively. According to the evaluation results, using the data distribution information generated by KDE has improved the true positive rates and positive predictive values, meanwhile the indicators for identifying negative samples were also advanced. In addition, the branching attributes selected by the decision trees have been mapped back to specific disease diagnoses, which are all associated with bipolar disorder. To the best of our knowledge, this study is the first attempt to apply KDE to knowledge distillation for supervised machine learning.
2. Materials and Methods
2.1. The input data
In the early 2000s, the "Laboratory for Computational Physiology" of the Massachusetts Institute of Technology (MIT) began to implement the project "Integrating Signals, Models and Reasoning in Critical Care". The main goal of this project is to build a large dataset for researches based on intensive care, the result of which is the database "Medical Information Mart for Intensive Care, (MIMIC)". The contents of this database come from Beth Israel Deaconess Medical Center (BIDMC). MIMIC is a publicly shared medical database. It contains de-identified information from electronic medical records for thousands of adult patients admitted to medical/surgical intensive care units and emergency wards. The development of this database is approved by the ethical review boards of BIDMC and MIT, respectively. MIMIC has been used extensively by academic researchers around the world, helping to promote advances in clinical informatics, epidemiology, and machine learning [
18].
In the database tables of MIMIC, all the information of the same patient are concatenated with the field value of “subject_id”. In this case-control study, the case group included patients with bipolar disorder and/or related symptoms. The following diagnostic codes were used when selecting case samples from the table “diagnoses_icd”. Their ICD-9 versions are 296.40~296.45, 296.50~296.56, 296.60~296.62, 295, 298; ICD-10 versions are F20, F29, F31. Then 10,000 people were randomly selected from these patients of bipolar disorder to form the case group. The date of the firstly diagnosed bipolar disorder for each case patient, i.e. the field value of “admittime”, was regarded as the index date. Finally for each case patient, the subject_id was used to retrieve all his/her diagnosis records in the database.
On the other hand, the control sample did not have diagnoses of bipolar disorder and any associated symptoms in the database. They were matched with the case patients in age and gender, i.e. the field values of “gender” and “anchor_age” from the table “patients”. In addition, in the month of the index date for a case patient, the corresponding control sample must have any diagnosis record, which represents similar health status. Based on the aforementioned matching conditions, this study selected the control samples at a ratio of 1 vs. 1 and 1 vs. 3, respectively. Finally for each control sample, the subject_id was used to retrieve all his/her diagnosis records in the database to form the input data.
2.2. Kernel density estimation
Kernel density estimation (KDE) is the application of kernel smoothing for probability density estimation, i.e., a non-parametric method to estimate the probability density function of a random variable based on kernels as weights [
15,
16]. KDE answers a fundamental data smoothing problem where distributions about the population are made [
17]. For the basic definition of KDE, let (
x1,
x2, …,
xn) be independent and identically distributed samples drawn from a specific distribution with an unknown density
f at any given point
x. Its kernel density estimator can be defined using Formula (1).
In Formula (1),
K(
x-
xi;
h) is the kernel function, whose outcomes are non-negative values. There exists a range of kernel functions being used, such as cosine, linear, normal, etc. [
15,
16]. The positive variable
h is called the bandwidth, which is a smoothing parameter and exhibits a strong influence on the resulting estimation. In this study, the class of “KernelDensity” from the scikit-learn package was used to perform the KDE analyses. After verification, the exponential kernel (i.e.
K(
x;
h) = exp(
)) was chosen to estimating distributions of input data for subsequent computations of knowledge distillation. The smoothing parameter
h was set to 0.2, which is the default value given by the scikit-learn package.
2.3. Embedding vector
In the application of machine learning, the content of category data needs to be converted into a special format before subsequent analyses can be performed. In addition to transforming them into numerical information, these representations should correctly retain the characteristic attributes of the original data contents. The idea of embedding vector will present a categorical data item (such as a word in a text) in the form of a multi-dimensional vector. Each element of the vector is a real number, and the contents of the vector can reveal the properties of the original data items [
19]. The embedding vector can be generated by the parameter optimization mechanism using a specific neural network architecture [
20,
21]. As for the loss function required in the learning process, its basic concepts are defined as Formula (2).
Formula (2) represents the conditional probability of correctly judging the context (i.e.
m words before and after
wi, which constitute contents in the sliding window as
wi-m, ……,
wi-1,
wi+1, ……,
wi+m) with the word vector
wi as the input premise. The probability value can be increased as much as possible through the parameter optimization mechanism. Then sum up the conditional probability values of all the words in the full text (e.g. a total of
N words), and the logarithm function is used to simplify the computation process. The expected loss function is shown in Formula (3).
When implementing the program suite of this loss function, the data structure of the Huffman tree can be used to improve the computational performance.
The "word2vec" proposed by Google in 2013 is currently the mainstream embedding vector algorithm [
20,
21]. The algorithm combines two learning mechanisms: skip-gram and CBOW (continuous bag of words). In the calculation of skip-gram, the word vector
wi is used as the input premise, and the predictions of
m word vectors before and after
wi, which constitute contents in the sliding window as
wi-m, ……,
wi-1,
wi+1, ……,
wi+m, are respectively produced. On the other hand, in the computation of CBOW, the 2
m word vectors within the sliding window, i.e.
wi-m, ……,
wi-1,
wi+1, ……,
wi+m, are used as the input premises, and the prediction of the word vector
wi is outputted.
2.4. Machine learning algorithms
The decision tree is a hierarchical model that uses a tree-like structure. In this model, each internal node represents a test on an attribute, and each branch represents the outcome of the test. At the bottom of the structure, each leaf node represents a class label, which is the decision taken after analyzing all of the attribute features [
22]. The path from the root note to a leaf represents a specific decision rule, and the conditions along the path form a conjunction of “if-then” clauses [
23]. The decision tree is a white-box model because the decision rules produced are easy to understand and interpret. The node branching function used can have an impact on improving the accuracy of the decision tree. Among various types of node branching functions, the Gini impurity is constantly used and was chosen in this study. According to the relative frequencies of class labels in the dataset, the Gini impurity measures how often a data item will be incorrectly labeled if it was labeled randomly and independently. For a dataset of items with
J class labels and relative frequencies
pi,
i ∈ {1, 2, …,
J}, the probability of correctly recognizing the class label of a data item, assuming it is class
i, is
pi. On the contrary, the probability of misclassifying that item is
. Therefore, the computation formula of Gini impurity I
G(
p) is defined as follows.
IG(p) reaches the minimum value zero when all data items in the node fall into a single class label.
The artificial neural network is a machine learning algorithm that imitates the human nervous system, and its definition formula is as follows [
24,
25].
Because the neural network can have a plurality of input and output neurons, they will be assembled respectively into the "input layer" and the "output layer". The matrix X represents the input values of a set of attributes, and the matrix Y simulates the output neurons for the computation results. The weight matrix W simulates the axons, which connect the input/output neurons and are responsible for transmitting messages. In the application problem, this represents the respective influences of different attribute characteristics. The matrix B of bias values simulates synapses and represents the degree to which the output neurons are activated. The higher the bias values are, the easier it is for a neuron to be activated and transmit the message. The symbol represents the activation function, which accepts a weighted sum of input values and performs a special calculation. If the resulting value is greater than the threshold, the output neuron is activated and the message is transmitted. In addition, the "hidden layer" can be added to the network architecture, containing nodes that mimic internal neurons. Since the hidden layer makes the network structure more complicated, it can handle more kinds of application problems, or simulate the interaction of more complex attribute features.
2.5. The analysis procedure
This study used the concept of knowledge distillation to construct predictive models of bipolar disorder. After the case patients and control samples were screened from the MIMIC database, all of their diagnosis records in the database were selected as the input data. In the MIMIC database, an average of 20 different disease diagnoses are recorded for each sample. Using the aforementioned word2vec algorithm, these disease diagnoses were converted into 8-dimensional embedding vectors. Therefore, the input data of each sample would be stored in a 20×8 matrix structure. The research team then planned two analysis procedures as follows.
Referring to
Figure 1a, in the first procedure the KDE was used to estimate the probability density function representing the distribution for the input data X. After the data X was input into the density function, the soft label information X
pdf was produced, which represented the likelihood values of the data distribution of the input X. Next, X
pdf was used as the input attributes of the training dataset, and the set Y contained the class labels as the learning targets. In this study, supervised learning methods such as decision tree and artificial neural network were used respectively to construct the predictive models for bipolar disorder.
Referring to
Figure 1b, in the second analysis procedure the KDE method was still used to convert the input data X into the soft label information X
pdf. Next, both of X and X
pdf were used as the input attributes of the training dataset, and Y still was the set of class labels for learning. Finally, decision tree and artificial neural network were used respectively to develop the predictive models for bipolar disorder.
3. Results
The datasets of this study were composed of case patients of bipolar disorder and the matched control samples, with a ratio of 1:1 and 1:3, respectively. The distributions of these data would be computed using KDE to produce the corresponding probability density functions as the soft label information for subsequent knowledge distillation. When using a machine learning algorithm to construct the prediction model for bipolar disorder, the randomly selected 80% of data samples would be used for model training and validation, and the remaining 20% were used as the testing set. When estimating the data distributions with KDE, we used the exponential kernel function. In addition, we set Gini impurity as the branching function for constructing the decision tree. When training the prediction models with artificial neural network, we chose ReLU and sigmoid respectively as the activation functions of the network nodes. Finally, cross entropy and Adam optimizer were set as the loss function and optimization mechanism respectively when training and validating the prediction models with artificial neural network.
In the following paragraphs of this paper, we define a specific sequence to express the architecture of the neural network. Assuming that the architecture contains three hidden layers, and the number of nodes in each hidden layer is v1, v2, and v3 respectively, then we use NN(v1, v2, v3, 1) to represent architecture of this neural network. Since the learning models in this study are all binary predictors of bipolar disorder, the last 1 in the sequence represents only one node in the output layer. There have been three types of network architecture evaluated in this study: NN(80, 10, 1), NN(160, 40, 1), and NN(80, 20, 10, 1). All of these architectures were tested and verified empirically.
Because the learning models in this study are all binary predictors of bipolar disorder, we adopt the terminology from a confusion matrix: true positive (TP), true negative (TN), false positive (FP), and false negative (FN). The following metrics are utilized for evaluating performances of prediction models trained by various machine learning algorithms respectively.
For the dataset of case patients and control samples with the matching ratio of 1:1, their respective probability density functions estimated by KDE are presented in
Figure 2 in the format of curve chart. Observing the content of
Figure 2, we can find that the respective probability density functions of case patients and control samples are quite different. In other words, they exhibit very different data distributions in diagnostic records used as characteristic attributes.
Next, we have tried to test whether the data distribution information estimated by KDE is helpful for constructing the learning model. For our first analysis procedure (
Figure 1a), the soft label information X
pdf, which represented the likelihood values of the data distribution of the input X, were used as the attributes for training and validating the prediction models. The evaluation results for the testing set are shown in
Table 1a.
For our second analysis procedure (
Figure 1b), both of X and X
pdf were used as the input attributes for training and validating the prediction models. The evaluation results for the testing set are shown in
Table 1b.
Finally, in order to verify the effectiveness of the soft label information X
pdf, only the data X were used as the attributes for training and validating the prediction models. The evaluation results for the testing set are shown in
Table 1c.
Comparing the results shown in
Table 1a,c, only using the soft label information X
pdf as the input attributes does not always improve the performances of the predictive models. However, when both of X and X
pdf are used for training and validating the prediction models (
Table 1b), not only the TPR and PPV are improved, but also the TNR and NPV become better.
In order to confirm that the data distributions generated by KDE can play a role in knowledge distillation, we repeated 10 times to randomly select case patients and matched control samples to form the dataset. Each time we used KDE to generate the soft label data Xpdf, and then the Xpdf were utilized to train a decision tree. Finally we examined the decision rules accompanying the tree structure and counted the features in Xpdf most frequently chosen as branching attributes. According to the descending order of the chosen frequency, the disease diagnoses corresponding to these branching attributes are listed below.
For decision rules leading to the positive label of bipolar disorder, the most frequent branching attributes include: hypertension; depressive disorder; anxiety disorder; suicidal ideations; type II diabetes mellitus; hyperlipidemia; esophageal reflux; chest pain; nicotine dependence; asthma; hypercholesterolemia; hypothyroidism; alcohol abuse.
For decision rules leading to the negative label of bipolar disorder, the most frequent branching attributes include: hypertension; hyperlipidemia; type II diabetes mellitus; chest pain; alcohol abuse; esophageal reflux; atrial fibrillation; hypercholesterolemia; depressive disorder; atherosclerosis/coronary heart disease; abdominal pain; urinary tract infection; hypothyroidism; nicotine dependence; headache; syncope and collapse.
For the dataset of case patients and control samples with the matching ratio of 1:3, their respective probability density functions estimated by KDE are presented in
Figure 3 in the format of curve chart. Again it can be found that case patients and control samples exhibit very different data distributions in diagnostic records.
For this dataset, the evaluation results of the testing set on prediction models of bipolar disorder trained using various learning algorithms are presented in
Table 2. Comparing the results shown in
Table 2a,c, the prediction models using the soft label information X
pdf as the input attributes constantly perform worse than models trained using the input data X. However, when comparing the results shown in
Table 2b,c, using both of X and X
pdf as the input attributes for training the prediction models improves all evaluation metrics.
Finally for the dataset of cases and controls with the matching ratio of 1:3, the decision tree analysis mentioned above has been executed again. Similarly we examined the decision rules accompanying the tree structures produced, and counted the features in Xpdf most frequently chosen as branching attributes. According to the descending order of the chosen frequency, the disease diagnoses corresponding to these branching attributes are listed below.
For decision rules leading to the positive label of bipolar disorder, the most frequent branching attributes include: hypertension; depressive disorder; anxiety disorder; suicidal ideations; type II diabetes mellitus; esophageal reflux; hyperlipidemia; nicotine dependence; hypercholesterolemia; asthma; chest pain; hypothyroidism; atherosclerosis/coronary heart disease.
For decision rules leading to the negative label of bipolar disorder, the most frequent branching attributes include: hypertension; hyperlipidemia; type II diabetes mellitus; esophageal reflux; chest pain; depressive disorder; alcohol abuse; hypercholesterolemia; atherosclerosis/coronary heart disease; atrial fibrillation; nicotine dependence; hypothyroidism; chest pain; headache; urinary tract infection; abdominal pain; syncope and collapse.
4. Discussion
In the evaluation results of this study, the predictive performances of the models trained only with soft label information X
pdf are not always better than those of the models trained with only input data X (
Table 1a,c)). Moreover, we can also observe trade-offs between PPV and TPR values when increasing the sample size of matched controls, i.e. from 1:1 to 1:3. In other words, comparing the prediction results of the models trained with X
pdf and X respectively on the testing set, we have found that the increase in PPV values is accompanied by the decrease in TRP values, or vice versa (
Table 2a,c)). In addition, increasing the sample size the matched controls means that the input data of the negative class increases, so the evaluation indicators NPV and TNR of the prediction models to identify negative testing samples will be improved (
Table 1 and
Table 2). Regardless of the matching ratio of case patients and control samples, we can observe that as long as the soft label information X
pdf are combined with the input data X to train the prediction models, the evaluation indicators PPV and TPR for identifying positive testing samples will be improved. At the same time, the indicators NPV and TNR for identifying negative samples have also been advanced (
Table 1 and
Table 2). To sum up, this study used KDE algorithm to generate the soft label information X
pdf which can make knowledge distillation work and may improve the predictive performances of the trained models.
In order for knowledge distillation to improve the prediction performance of the trained model, the soft label information must provide accurate distribution conditions of the input data. Referring to the research work of G. Hinton et al, they argued that adding a "temperature" variable to the formula that normalizes the predicted outputs can smoothening the probability distributions for the class labels. Moreover, the research team proposed to use the probability distribution values produced by the sophisticated deep learning model as the soft label. Input these reference information together when training a shallow neural network model can achieve prediction accuracies close to those of sophisticated deep learning models. They concluded that the “knowledge” of a deep learning model can be transferred to a shallow “distilled” learning model [
14]. In addition, the parameter optimization of artificial neural network can adopt the concept of conditional probability. Under the premises of input data and current parameter settings, the predicted conditional probability distributions can approximate the true distributions of class labels. A typical solution for this problem is the Monte Carlo approximation. However, this method needs to construct multiple sets of prediction models and store multiple sets of parameter settings. Consequently more computing resources are required. In view of this, the research work of A. Korattikara Balan et al proposed an improvement. Firstly, sample data are selected to construct multiple models of different neural networks. These network models form the "teacher group" for ensemble learning. The group of teachers produce sets of outcomes, which were presented as the conditional probability distributions. These probability distributions were used as the learning targets for the "student" neural network. The parameters of the student network model are optimized through the training process. Therefore, the final outcomes of the student network, which are also presented as probability distributions, can be thought of as approximating the conditional probabilities provided by the teacher group. The approach proposed by this work also amounts to the student network “distilling” the knowledge provided by network models of the teacher group [
12]. On the other hand, essentially KDE is a non-parametric method of estimating distributions of data samples. It is known that KDE has been applied in estimating the conditional probability distributions of input data when using a naive Bayes classifier [
17,
26]. Referring to the contents of the aforementioned literatures, this study was inspired to combine KDE for knowledge distillation to construct prediction models for bipolar disorder.
When KDE is used for data analysis, it often focuses on the setting of bandwidth. This parameter has a great influence on the accurate estimation of data distributions. If the set value of bandwidth is too small, the under-smoothed distribution will contain many spurious data artifacts. On the contrary, if the value of bandwidth is set too large, the over-smoothed distribution will obscure much of the underlying structures. There has been numerous studies discussing the criteria to set this parameter [
15,
16,
27]. A novel KDE method developed by our research team has been exploited to identify interesting patterns hidden in the dataset. The main features of this method include minimizing the bias part of the mean square error, and elevating the bandwidths of the kernel functions to alleviate the effects of variance. It has been verified that our novel KDE can estimate the distributions of input data more accurately than many traditional KDE methods [
28,
29,
30,
31]. Therefore, one of our future works will use this novel KDE for knowledge distillation to construct more accurate predictive models.
In order to further verify effectiveness of the soft label information X
pdf generated by KDE, we examined the decision rules of the tree structures constructed with X
pdf. Regardless of the matching ratio of case patients and control samples, we have found that identical disease diagnoses are selected as the branching attributes from the analysis results. The contents contained in X
pdf are not categorical disease descriptions, but likelihood values of the probability density functions generated by KDE. Therefore, the features selected as branching attributes in the decision rules must be mapped back to the categorical disease descriptions. Since identical disease diagnoses are always selected as the branching attributes, X
pdf do provide correct distribution information of the input data. On the other hand, through survey of reference literatures, we have found various associations between bipolar disorder and these disease diagnoses selected by the decision trees. It is known that 6% of patients with bipolar disorder die by suicide, and another 30-40% suffer from self-harm [
1]. Many patients with bipolar disorder also suffer from other mental illnesses, such as anxiety disorders, schizophrenia, substance abuse, etc. Furthermore, one typical symptom of the depressive phase of bipolar disorder is fatigue [
1]. Moreover, some diseases have a higher incidence in patients with bipolar disorder compared to the general population, including metabolic syndrome, migraine, obesity, and type II diabetes [
5]. In addition, compared to the general population, patients with bipolar disorder have twice the risk of dying from coronary heart disease [
1]. Meanwhile, hypertension, hyperlipidemia, hypercholesterolemia, chest pain, etc., are typical risk factors and symptoms of coronary heart disease.
Since bipolar disorder and asthma are leading causes of morbidity in the US, recently a cross-sectional analysis explored the clinical characteristics of bipolar disorder and an asthma phenotype and fitted a multivariable regression model. The evaluation results concluded that a history of asthma is common among patients with bipolar disorder [
32]. Some medical illnesses with clinical presentations similar to symptoms of bipolar disorder, such as the similar features between migraine headache and bipolar disorder. Some symptoms also need to be identified whether they are caused by bipolar disorder or endocrine system diseases such as hypothyroidism or hyperthyroidism [
33]. Another study conducted in Sweden has found that higher odds for bipolar disorder co-morbidity occurred in patients with gastroesophageal reflux disease [
34]. Furthermore, recently a genome-wide pleiotropic association study using genome-wide association summary statistics concluded that the pleiotropic genetic determinants between gastrointestinal tract diseases and bipolar disorder are extensively distributed across the genome. The findings provide supports for the shared genetic basis underlying the gut-brain axis [
35]. Referring to the research work of Benjamin J S Al-Haddad et al, a total of 1,791,520 Swedish children born between 1973 and 2014 were observed for up to 41 years using linked population-based registries. The analysis results suggested that fetal exposure to any maternal infection, such as urinary tract infection, while hospitalized increases the risks for autism and depression, but not bipolar or psychosis, during the child's life [
36]. However, ketamine is mainly used for bipolar disorder, and it has been reported that longstanding ketamine abuse may cause urinary tract infection [
37].
It is known that cerebrovascular reactivity (CVR) represents the relax ability of cerebral blood vessels to vasoactive substances, and is a quantitative indicator for cerebrovascular health. Results of the analysis performed by Adam L Urback's research team have shown that adolescents with bipolar disorder had lower CVR values in the posterior cingulate and periventricular white matter than the mentally healthy controls. After adjusting for the effect of BMI values, further group differences in CVR values were observed in the regions of temporal pole, supramarginal gyrus, and lingual gyrus. In conclusion, his study reported preliminary evidence that bipolar disorder is associated with cerebrovascular dysfunction, pointing to areas of the brain that predispose to cerebrovascular diseases [
38]. The research work of Paul J Harrison et al. has compared the incidence of various disorders, including Parkinson's disease, dementia, cerebrovascular disease and stroke, during a follow-up period of at least one year after the diagnosis of bipolar disorder. Several risk factors were taken into account as covariates in the regression analysis. The results have shown that bipolar disorder may increase the risk of developing cerebrovascular disease and stroke, although the physiological mechanisms leading to this phenomenon still need further investigation [
39]. A recently published study conducted by Sermin Kesebir et al. has performed a follow-up assessment of 1,148 bipolar disorder patients admitted to a hospital. Each patient was interviewed to investigate the incidence of various target diseases in his/her first- and second-degree relatives as well as himself/herself. It was found that a family history of diabetes mellitus was strongly associated with bipolar disorder, and a family history of thyroid disease was correlated with co-occurring anxiety disorders. Finally, this study also observed a co-morbid association between bipolar disorder and cerebrovascular disease [
9].
To sum up, the soft label information Xpdf generated by KDE provide correct data distributions, so they help the decision tree algorithm to select the appropriate branching attributes to construct the prediction models. These branching attributes can be mapped back to specific disease diagnoses, which are all associated with bipolar disorder.