Investigation of Machine Learning Models and Different Feature Sets for the Efficiency of Early Sepsis Prediction from Highly Unbalanced Data

The presented research faces the problem of early detection of sepsis for patients in the Intensive Care Unit. The PhysioNet/Computing in Cardiology Challenge 2019 facilitated the development of automated, open-source algorithms for the early detection of sepsis from clinical data. A labeled clinical records dataset for training and verification of the algorithms was provided by the challenge organizers. However, a relatively small number of records with sepsis, supported by Sepsis-3 clinical criteria, led to highly unbalanced dataset (only 2% records with sepsis label). A high number of unbalanced data records is a great challenge for machine learning model training and is not suitable for training classical classifiers. To address these issues, a number of various models were investigated. A solution including feature selection and data balancing techniques was proposed in this paper. In addition, several performance metrics were investigated. Results show, that for successful prediction, a particular model having few or more predictors based on the length of stay in the Intensive Care Unit should be applied.


Introduction
Sepsis is a syndrome of physiological, pathological, and biochemical abnormalities induced by infection [1]. The conservative estimates indicate that sepsis is a leading cause of mortality and critical illness worldwide [2,3]. World Health Organization concerned that sepsis continues to cause approximately six million deaths worldwide every year, most of which are preventable [4]. In their study, the Department of Health in Ireland reported that survival from sepsis-induced hypotension is over 75% if it is recognized promptly, but that every delay by an hour causes that figure to fall by over 7%, implying that the mortality increases by about 30%.
In this paper, we present our solution for early detection of sepsis by joining the PhysioNet/Computing in Cardiology Challenge 2019 [5]. Here, a detail explanation of the Challenge data, participant evaluation metrics, and primary results are provided, and therefore, we will not explain it in this paper. However, a few important findings we should share in this paper in order to better explain the motivation to construct our algorithm in a particular way.
According to the requirements of the Challenge, our open-source algorithm works on clinical data provided on a real-time basis by giving a positive or negative prediction of sepsis for every single hour. The algorithm predicts sepsis development for the patient using a pre-trained mathematical model. Therefore, not only the appropriate model should be used but also the training should be performed in the right way.
Data used in the competition was collected from ICU patients in three separate hospital systems. However, data from two hospital systems only were publicly available for training (40,336 patients in total). Another set of records (24,819 patients in total), obtained from all three different hospital systems was hidden and used for official scoring only by challenges organizers. Such separation of the data prevented participants from over-fitting their models. Taking into account that the trained model may learn not only dependencies in the clinical records but also hospital system-related behavior, for our approach, we have tested different data selection strategies for training. Models trained on hospital system A data we tested on data from hospital system B and vice versa.
The most challenging issue in the available data records was a high number of unbalanced records. Only 2,932 septic patients were included in the dataset, together with 37,404 non-septic patients. From the perspective of mathematical model training, the data balance is much worse. Since the sepsis prediction had to be made on an hourly basis, 6 hours in advance to the onset time of sepsis, specified according to Sepsis-3 clinical criteria, a number of non-sepsis examples we also took from the septic patient early records. After such reorganization of training data, only 2% from 1,484,384 [1,424,171] events (16, [7]. Sweetly et al. created 54 datasets using same sepsis data and different non-sepsis data records [8]. He et al. have applied a random subsampling [9]. An interesting approach was proposed by Li et al., where they decided to divide data into three stages (1-9, 10-49 and above 50 hours stay in ICU) [10].
Dealing with missing values is another decision to be taken and it also may have an influence to the selected model training and overall performance. Forward-fill method [8,[11][12][13][14][15]. Singh at al. found in their study, that mean imputation model gave worst results [16]. Other authors successfully used mean calculation over whole dataset [15,17].
Our proposed algorithm was scored on a censored data set, dedicated for scoring and using utility function that rewards early predictions and penalizes late predictions as well as false alarms.

Material and Methods
In this section, we address the challenges regarding the problem of early sepsis detection and propose a methodology to overcome them. A labeled clinical records dataset for training and verification of the algorithms was provided by the PhysioNet/Computing in Cardiology Challenge 2019 organizers [5].

The Data
Data contained records of 40,336 ICU patients with up to 40 clinical variables divided into two datasets, based on hospital systems A and B. For each patient, the data were recorded at every hour during the stay in ICU. The records were labeled (on an hourly basis) according to Sepsis-3 clinical criteria. A total of 1,407,716 hours of data was collected and labeled. Data labels included vital signs, laboratory values, and demographic values of the patients. Eight vital signs were a heart rate (HR), pulse oximetry (O2sat), temperature (Temp), systolic blood pressure (SBP), mean arterial pressure (MAP), diastolic blood pressure (DBP), respiration rate (Resp) and end-tidal carbon dioxide (EtCO2). A total of 26 laboratory values were included in the dataset. Demographic values include age, gender, hospital identifiers, the time between hospital and ICU admission (Hosp), and ICU length of stay (ICULOS). Data were labeled as positive 12 hours before and 3 hours after the onset time of sepsis. Positive labels of sepsis were found in 2932 of the 40336 records, which is 7.27% of the data. Labels consisting of positive (sepsis) labels were found in 27,916 rows, which is only 1.98% of all data.
Investigation of the data showed large numbers of missing values. The percentage of missing rows of vital signs is shown in Figure 1. Missing values of vital signs make about 10% of the data, with the exception of Temp ( 66% missing data) and EtCO2 (100% and 92% missing data, for dataset A and dataset B, respectively). Therefore, EtCO2 was not used as a feature for the model. The percentage of missing rows of laboratory measurements is shown in Figure 2. Missing data of laboratory values makes from 78% to 100% for all values. We did not use laboratory values to develop our model.   Average values of vitals are shown in Table 1. Measured SBP, MAP and DBP values are higher in dataset B. Also, the measured HR is slightly lower in dataset B. Having two datasets collected in separate hospitals allowed us to develop models that are robust to measurement errors rising from the specificity of electronic medical record systems. Thus, the nature of data increases the difficulty of predicting sepsis. During the development of the model, we had to take into account the high unbalance of positive and negative cases, large amounts of missing values, and the fact that data was recorded using two different measurement systems.

Feature Extraction
A solution proposed in this paper to the early sepsis prediction problem employs information of the ICU length of stay, hospitalization time, age, and seven vital signs: HR, O2sat, Temp, SBP, MAP, DBP, and Resp. We did not use EtCO2 for feature extraction due to a large number of missing values.
We have calculated the mean, standard deviation, and the max-min difference for the vital sign data. We took those values from the whole duration of the record. Additionally, we have considered some other measures for our approach, such as kurtosis, entropy, and the standard error. However, after further analysis, we decided to discard these features. Kurtosis can only be calculated for four or more variables, not including the missing values. Additionally, kurtosis is not a representative statistic estimate for sample sizes less than 200 [18]. The entropy value is proportional to the sample size. In the problem we have investigated in this paper, the sample size changes each hour of the patient stay and can be reduced with missing values for some patients [19]. Therefore, in this case, the entropy just represents a number of samples used for its calculation. Thus, it is unlikely that entropy can carry useful information for the model training. The standard error is calculated by a division by the sample size, and it is inversely proportional to the sample size. Therefore, results can lead to a reduction of standard error for larger data sample sizes, which in its order increases unwanted load model training [20].
After our theoretical investigation, we have calculated 21 features for each hour. Missing values of the data were removed when calculating features. In some cases, features could not be calculated due to a small set of available data (e.g., during the first few hours of ICU stay, or due to a large number of missing values). In such cases, we set the value of the feature to '-1'. Finally, we have assembled a feature set of 24 features for model training: 21 calculated featured from vital signs and three demographic values (Hosp, age, ICULOS). Obtained data had different measurement units, measurement errors, and scales. Therefore, we have applied data standardization to have zero mean and standard deviation '1'. The sample mean and sample standard deviation were used the same for each patient, obtained from all sample data of both datasets containing 40336 patients in total.

Model Training
In our investigation, we have used models based on Decision trees, naive Gaussian Bayes, Support Vector Machines (SVM), and Ensemble learners. Decision trees based models were important in this stage of the problem solution. The Decision tree models give insights about the relevance of selected features. However, they tend to overfit the data [21]. Less over-fitting can be expected when Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 12 May 2020 doi:10.20944/preprints202005.0205.v1 using Ensemble learner models [22]. We have trained the ensemble learning-based models using hyperparameter optimization among Bag, GentleBoost, LogitBoost, Adaboost, and RUSBoost methods. Gaussian naive Bayes based models are known for their simplicity, high bias, and low overfit. Typically good results using Naive Bayes are achieved using low variance data. These models are not recommended for high variance data [23]. Models trained using SVM tends to overfit data less. However, they are not very successful in problems with a high number of missing values in data [24].
Data of the experiment was strongly unbalanced, as discussed in 2.1 sub-section. Additionally, we have used the specific scoring system for positive predictions, as it was proposed in Physionet Challenge 2019. We have weighed the positive and negative predictions in accordance with the duration before or after the onset time of sepsis. Positive values, with various weights, make only 1.98% of the data. We have addressed the data balance issue by adjusting weight for positive prediction. For this purpose, we trained investigated models using adjusted classification cost, where a true positive prediction we rewarded by multiplying by 1, 10, 20, 30, or 100.
We have trained each model separately using features estimated on an hourly basis in random order. In order to avoid over-fitting and increase robustness, we have trained the models on records taken from a single hospital system (dataset A) and tested on records from another hospital (dataset B, hidden during the training). We trained models on the dataset A using 5-fold cross-validation. Using our selected approach, only half of the available data was used for training. However, as it was shown in the results of the challenge [5], proposed solutions to the problem performed well on known datasets, even if scoring was done on a hidden part of the same set, and performed marginally worse on new hospital system C, hidden from challenge contestants. Additionally, the advantage of this approach to train and evaluate models is supported by Biglarbeigi [25].

Model Scoring
Models, proposed in this paper for Sepsis prediction, were evaluated using several different metrics. Traditional scoring metrics, such as the area under the receiver operating characteristics (AUROC), the area under the precision-recall curve (AUPRC), accuracy, F-measure, and Matthews correlation coefficient (MCC) were used. Additionally, investigated models were scored using a specific scoring function developed by the authors of the dataset, called Utility. Using different scoring metrics allowed us a better comparison of investigated models. AUPRC is recommended for imbalanced data over the AUROC measure [26,27]. F-measure is a harmonic mean of precision and recall [28]. Lately, MCC measure was shown to be more advantageous over F-measure in the binary classification of imbalanced data [29].
Specifically designed scoring function (by the authors of the data) rewards algorithms for early predictions and penalizes them for late or missed predictions and false alarms. Scoring was conducted by predicting each hourly label for each patient. Each positive label had a defined score depending on correct prediction time to sepsis. Scoring function awarded models for correct prediction at most 12 hours before and 3 hours after the onset time of sepsis. Scoring function penalized models who predicted septic state 12 hours before onset time of sepsis and slightly penalized models with false-positive predictions. True negative predictions were not penalized or rewarded by the function. A more detailed scoring description can be found in the original paper [5]. The utility score was a reference metric. Using it, we evaluated the performance of our investigated models. AUROC, AUPRC, accuracy, F-measure, and MCC scores were used to gain insight into the models (e.g., if they correlate with any other parameter of the experiment, such as classification cost, feature reduction, model configuration, or Utility score).

Results
As it was noted in subsection 2.3, decision trees, naive Gaussian Bayes, SVM, ensemble learners were investigated in our experiment. Various parameters of the models were adjusted. Additionally, the effect of classification cost and feature reduction was investigated. The performance of developed Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 12 May 2020 doi:10.20944/preprints202005.0205.v1 models was evaluated using Utility score as a reference metric. Complementary, several other metrics, such as AUROC, AUPRC, accuracy, F-measure, and MCC were calculated to compare investigated models. Models were trained using dataset A, and scored using dataset B. Results of the experiment are given in Table 2. Models, based on decision trees, are labeled from 1 to 14. Models, based on the Naive Bayes algorithm, are labeled from 21 to 25. SVM based models are labeled from 31 to 34. Models from 41 to 44 were using an optimizable ensemble method for searching the best model for the problem. Decision trees are fast to train and to evaluate. We started our investigation using these models. The baseline score of Model1, with default parameters, gave a Utility score of 0.01. Secondly, a feature reduction using principal component analysis (PCA) was applied. Six features to explain 95% variance was kept. Model2 gave Utility score of 0.004. For Model3, increasing feature set to 14 (out of 24) increased Utility score to 0.0124. Forth model used 14 features and a modified classification cost ratio of 1:10. Model4 obtained Utility score of 0.1236. Further increasing classification (Model5) ratio to 1:100 Utility score decreased to −0.296. Using all available features in the set (24 features) Utility score was slightly improved to −0.242, for Model6. Using 24 features and classification cost 1:10 obtained Utility score was 0.184 for Model7. For Model8 we modified classification cost to 1:20, obtained Utility score was 0.22. Further increasing classification cost to 1:30 (Model9) decreased Utility score to 0.216. Model10, having a modified classification cost of 1:20, and a reduced feature set to 20 got Utility score decreased to 0.157. Next, we limited the tree split criterion to 50. Model11 achieved Utility score of 0.232. Higher Utility score was achieved by reducing split criterion to 4 -0.233 (Model12). Reducing split criterion to 2 for Model13 got a similar Utility score of 0.233. Model14, with a further reduced Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 12 May 2020 doi:10.20944/preprints202005.0205.v1 split criterion to 1, achieved the highest Utility score of 0.242. Only 1 tree branch was used for this model. A feature that was used for this model was ICULOS. Features used for Model12 and Model14 also included ICULOS, and also mean SBP and Resp. Models labeled from 21 to 25 were based on the Naive Bayes algorithm. Model21 without feature reduction and using classification cost 1:20 achieved Utility score of 0.1334. For Model23, using PCA, the number of features was reduced to 14, achieved Utility score was 0.129. Further reducing the number of features to 6 (95% explained variance using PCA) improved Utility score to 0.150, as shown in Tables 2 Model24 row. Adjusting classification cost lead to reduced score -Model22 and Model25 were using reduced feature set and modified classification cost of 1:10 and 1:30, they yielded Utility scores of 0.097 and 0.143, respectively. SVM models were computed using the Gaussian kernel function. Results of SVM models are shown in Tables 2, under the Model31 to Model34. Model31 using a classification cost ratio of 1:20 and 24 features for training achieved 0.151 Utility score. Model32, using 6 features to explain 95% variance achieved 0.144 Utility score. Model33 and Model34 using classification costs 1:30 and 1:10, respectively, achieved 0.1294 and 0.1302 Utility scores.
Models labeled from 41 to 44 were trained using ensemble methods. Model41, when classification cost was set to 1:20, with full feature set achieved Utility score of 0.082. Using PCA (95%) reduced Utility score to 0.008 (Model44). Reducing split criterion to 10 (Model42) gave improved Utility score -0.124, reducing split criterion to 4 (Model43) further improved Utility score -0.173, further reducing split criterion to 1 did not improve Utility score -0.173.
High AUROC, AUPRC, and accuracy scores using ensemble learners were achieved using Model41: 0.413, 0.012, and 0.955, respectively. However, this ensemble model achieved low Utility (0.082). In the same manner, low AUROC (0.347), AUPRC (0.01), and accuracy (0.939) scores, and highest Utility score (0.173) was achieved using Model43. Other investigated models performed similarly, high AUROC, AUPRC and accuracy, and low Utility score; or low AUROC, AUPRC and accuracy, and higher Utility score was observed in all investigated models, namely decision trees, SVM, naive Bayes, and ensemble-based models.

Discussion
Highest Utility score was achieved using decision trees with the low number of nodes, ICULOS was included in all developed decision tree models. Therefore, we believe that future models should be developed based on ICU-stay time. E.g., one model predicting recently hospitalized ICU patients, another would be used if a patient's ICU length of stay reaches a certain hour. Also, this approach can be implemented using 3 or more temporal divisions. This finding of our investigation is supported by Lauritsen [30], Vincent [31] and Shimabukuro [32] papers. Each intervention, vital measurement, intravenous therapy, and duration of stay in general increases a chance of infection -a direct cause of sepsis.
Regarding the dataset, other papers, who tackled this problem proposed methods, which were trained on both datasets, and officially scored on hidden set C. Most of the challengers performed well on known hospital systems, obtaining Utility scores about 0.4. However, Utility scores for the hidden hospital systems typical were negative [5]. One author suggested to evaluate the proposed methodology using one dataset and test it on another [25]. Our achieved Utility score for the known dataset, but hidden data was also above 0.4 (e.g., investigated ensemble models achieved Utility scores around 0.66 on dataset A). However, the Utility score using hidden set (dataset B) dropped to 0.173.
We assume that the Utility score can be improved a little by finding better value for classification cost, where a true positive prediction reward would be multiplied somewhere between 20 and 30. However, this would fit the data and would not solve the general problem of the Challenge. Therefore, we recommend using an arbitrary value between 20 and 30 to increases the robustness of the system. MCC and F-measure scores gave similar results, which increases and decreases with the Utility score. However, bounds of MCC are from -1 to 1, while F-measure from 0 to 1. Bounds of Utility score are from -1 to 1. We support the idea of using the Utility score as a metric for this dataset. Moreover, we showed that MCC and F-measure are effective metrics for this problem. Additionally, due to the nature of the Utility score, results can be difficult to interpret, as Roussel et al. pointed out in their work [33].
Investigated decision trees achieved Utility score of 0.242, AUROC score of 0.313, and MCC score of 0.143 on hidden set. Models with such results are far from applicable to the clinical setting. Additionally, our investigation showed that increasing AUROC and accuracy usually leads to decreased Utility, F-measure, and MCC scores. Moreover, accuracy is high for all investigated models. Accuracy can be miss-leading when interpreting models, results for this kind of highly unbalanced data, and a large number of negatives [22]. When developing methods for this kind of problem, one needs to be careful; the accuracy of 98.2% can be achieved just by guessing all rows as negative. We showed that balancing data reduces AUROC, accuracy scores and improves F-measure, MCC, and Utility scores.
There are many models to experiment with, e.g., k-Nearest Neighbor (kNN) and Long Short-Term Memory (LSTM) models were not tested in our work. LSTM models are more difficult to configure to use them effectively. Additionally, LSTM tends to overfit the data. Moreover, even if one successfully tackles the overfitting problem, there is still another downside, which is more important in the current state of the early sepsis prediction problem -developed model may be hard to interpret and would not reveal much insight into data [34]. The clustering of unbalanced data (including Sepsis-related records) may give promising results for sepsis prediction. However, kNN overfits data with large variances [35]. On the other hand, a trained kNN model having 1000 or 2000 clusters to represent the data can be expected to be robust. In general, we believe results using these models can be promising, and we encourage future works exploring LSTM and kNN model capabilities.
It is notable that investigated models do not differ significantly in Utility score if a number of features is reduced. This shows that some features are not useful for the model. On the other hand, our proposed features were relatively simple. We believe that more advanced features are needed to solve the early detection of sepsis problem. Using advanced features should improve the score. However, feature engineering is a difficult, time-consuming process, which also requires understanding the Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 12 May 2020 doi:10.20944/preprints202005.0205.v1 nature of the data. In this paper, we provide many insights into the nature of the data, different scoring metrics, advantages of various models, and feature combinations. We believe that the results of our investigation presented in this paper will benefit the fundamental need of early sepsis prediction and will answer some basic questions about the limits of early detection. Our results should benefit the search for advanced combinations of features, ease the use of machine learning tools. With meaningful insights peer researchers can apply advanced feature engineering techniques and develop more sophisticated and robust models in order to reach reliable results. Reaching better results is available through the use of combined models and handcrafted features [36], thus, further contributing to this field. The main challenges of this problem, as we revealed, are: the highly unbalanced dataset, the high number of missing data, simple features calculated using vitals does not have enough predictive power, proposed solutions are prone to over-train. Adjusting classification costs helps to address the later problem. In addition, the insights and conclusions of our experiment may benefit not only machine learning specialists, researchers, but also ICU personnel and scientists in the medical field.

Conclusions
In this study, we provide a comparison of several alternative methods for early sepsis prediction. Our selected models and insights show how to deal with unbalanced data and with a large number of missing values.
The results, obtained during experimental investigation, are based on publicly available data containing 40,336 records with 1,407,716 of rows and 40 dimensions. Results showed: 1. Random tree models achieved higher Utility score compared to other tested models when using basic vital and demographics to calculate simple features. 2. Adjusting classification cost improves the Utility score of the tested models. Best results, on the investigated dataset, were achieved when the reward of true positive prediction was increased 20 times. 3. Feature ranking, using PCA, applied for our proposed features does not always improve Utility score. Utility score changes, when reducing the number of features based on the investigated model. In some models, such as Naive Gaussian, reducing the number of features improved the Utility score. 4. Performance metrics AUROC, AUPRC, and accuracy do not reflect the Utility score. These metrics can be high for models with low Utility scores. Dealing with the early sepsis prediction problem, one should not apply these performance metrics. On the contrary, F-measure and MCC performance metrics reflect the Utility score. 5. High Utility score was obtained using decision tree models limited to 50 and fewer splits.
All investigated decision trees chose ICULOS -ICU length of stay, as an important feature. Additionally, reducing the number of tree splits up to 4, and 1 further increased the Utility score. Utility score of 0.242 was achieved using only ICULOS as a single feature for the decision tree model.