Three types of tasks were prepared from the cohort: mortality, length of stay, and septic shock. This section describes the tasks, the test results and which features were most important to the results.
3.1. Mortality Task
The predictive efficacy was quantified using a combination of AUROC, AUPRC, and Accuracy. Given the prevalence of class imbalance in the MIMIC-IV dataset, where septic events are significantly less frequent than non-events, the AUPRC serves as a critical metric for evaluating the trade-off between sensitivity and positive predictive value [
33]. This was complemented by the AUROC to determine the global discriminative capacity and accuracy to assess the total proportion of correct classifications, ensuring a comprehensive validation of the model’s clinical utility.
Table 5 presents the results of the tests for all algorithms.
Comparative analysis of mortality prediction revealed that gradient-boosted decision trees (GBDTs) outperformed both traditional linear models and complex deep learning architectures. XGBoost achieved the highest performance across all metrics, with an AUROC of 0.874 and AUPRC of 0.606. XGBoost demonstrated superior precision in identifying the positive class, as evidenced by its higher AUPRC.
Despite having the same AUROC as LightGBM, XGBoost performed better in terms of AUPRC, indicating that it made fewer errors in predicting death (fewer false-positives in the minority class).
Deep learning models, such as transformers and LSTM, achieve good results but are often outperformed by tree-based models on tabular data [
34,
35]. Unlike studies utilizing unstructured clinical text [
36], our structured approach facilitates integration into standard workflows.
Figure 1 presents the confusion matrices for the four best algorithms in this task. The layout follows the Scikit-Learn Python library standard, where class 0 (negative/survivor) is presented in the first row/column and class 1 (positive/death) in the second.
The confusion matrix can be interpreted as follows:
Upper left quadrant: True Negatives (TN), representing patients who survived (negative), and the model correctly predicted survival.
Upper right quadrant: False Positives (FP), representing patients who survived (negative), but the model incorrectly predicted death (positive). This situation is a false alarm that can be ignored.
Lower left quadrant: False Negatives (FN), representing patients who died (positive), but the model incorrectly predicted survival (negative). This is a critical error.
Lower right quadrant: True Positives (TP), representing patients who died (positive), and the model correctly predicted death.
The main diagonal (TN and TP) contains the correct predictions (starting with ’True’). The secondary diagonal (FN and FP) contains incorrect predictions (starting with ’False’). The first and second rows represent surviving and deceased patients, respectively.
The random forest model (
Figure 1(d)) exhibited highly conservative behavior; while it minimized the lowest false-positive count (85), it exhibited a clinically concerning rate of False Negatives (776), failing to identify a significant proportion of patients at high mortality risk. On the other hand, the Transformer,
Figure 1(c), identified the highest number of True Positives (380), but it had a higher rate of false positives, which could contribute to excess alarms in a clinical setting.
The XGBoost (a) and LightGBM (b) models exhibited the most balanced diagnostic performances. In particular, XGBoost showed superior consistency. It demonstrated better control over false positives (161 against 177 for LightGBM) and achieved the highest overall discriminative ability, with AUROC = 0.874 and AUPRC = 0.606, respectively.
Figure 2 shows a summary bar for XGBoost mortality prediction. The top-15 predictors are displayed along with their relevance to the model. According to the XGBoost algorithm, the Oxygen Flow Device (serving as a proxy for respiratory severity and need for intervention), Charlson Comorbidity Index, Urea Nitrogen, and the Sofa Score were the main predictors of mortality.
On the predictor variables side, we have the measurement time, which was taken every 4 hours, as explained earlier. For the mortality prediction task, the first 24 hours were used to predict mortality. The interval T-0 represents the most recent measurements, taken 24 hours after sepsis identification, while T-1 refers to measurements within 20 hours of sepsis identification, and T-5 refers to measurements at the time of sepsis identification.
The SHAP beeswarm plot visualizes the global importance and directional effects of each feature. The variables were ranked vertically according to their overall influence on the models. The horizontal axis represents the SHAP value: positive deviations to the right of the vertical line indicate an increased probability, whereas negative values to the left of the vertical line suggest the opposite effect. The color gradients denote the original feature magnitudes: red for high values and blue for low values.
The SHAP analysis in
Figure 3 reveals that the Oxygen Flow Device, Charlson Comorbidity Index and Urea Nitrogen are the top predictors of mortality, where high values (red points) correspond to positive SHAP values, indicating an increased risk. In contrast, the Glasgow Coma Scale (GCS) score and Urine Output (Uo) showed an inverse relationship: high values, red color, shift towards the negative SHAP region, and reduced risk of mortality. Low values (blue), representing reduced consciousness, were strongly associated with higher mortality.
The SHAP waterfall plot shows the decision-making process of the model for a single instance, visualizing how each feature changes the prediction from the population baseline expected value to the final individual probability. Features are displayed in descending order of impact, where red bars indicate factors that increase the probability model and blue bars represent factors that decrease the probability.
In the case represented by
Figure 4, the model estimated a low mortality probability, represented by the vertical line and the function value
, which was significantly lower than the baseline of
.
Despite the presence of static risk factors, such as advanced age () and comorbidities (Charlson Comorbidity Index, ), the prediction was driven down by the dominant protective clinical indicators. Specifically, the absence of high-risk respiratory support requirements (Oxygen Flow Device, ), combined with timely antibiotic administration (Hours Since First Abx, ) and stable physiological parameters (News Score and Urea Nitrogen, each), effectively mitigated the risk, leading to a favorable survival prediction.
3.2. Length of ICU Stay Task
For the regression task of predicting Length of Stay (LOS), the results shown in
Table 6 confirm the dominance of gradient boosting algorithms over deep learning architectures in tabular clinical data.
LightGBM achieved the best performance, registering the lowest errors across all metrics (RMSE: 4.801; MAE: 2.599), closely followed by XGBoost. This indicates that tree-based ensemble methods are more effective at capturing non-linear relationships and interactions between clinical variables in tabular data [
34,
35] than recurrent or attention-based networks.
Complex deep learning models, such as Transformers and LSTMs, performed worse in this regression task, resulting in higher error rates (RMSE of 5.486 and 5.049, respectively). Complexity is not always associated with high accuracy.
Clinically, the Mean Absolute Error (MAE) of approximately 2.6 days achieved by the LightGBM model represents a good parameter for resource planning, offering a reasonably accurate margin for bed management and discharge scheduling.
There are no confusion matrices for LOS because this is a regression task (we are predicting a continuous number of days, e.g., 2.5 days) and not a classification (yes/no) task.
Feature importance analysis for the Length of Stay (LOS) regression task,
Figure 5 reveals that indicators of organ dysfunction and therapeutic intensity are the primary drivers of hospitalization duration. Length of stay (LOS) prediction is heavily influenced by dynamic physiological responses and resource utilization.
Renal function and fluid balance were the primary determinants of the LOS. Respiratory support followed closely, with the oxygen flow device and mechanical ventilation ranking second and third, respectively. These findings align with clinical reality, confirming that dependence on invasive ventilation and aggressive fluid resuscitation are important drivers of prolonged hospitalization.
The SHAP beeswarm analysis for Length of Stay (LOS),
Figure 6, provides insight into how specific feature values influence hospitalization duration. This reveals that the impact of the top predictors, specifically Uo Total, Oxygen Flow Device, and Mechvent, was heavily skewed towards increasing LOS.
The long tails of the red points extending to the right demonstrate that high values of these therapeutic intensity markers specifically drive predictions of significantly extended hospitalization, whereas lower values (blue) cluster near the baseline, indicating standard recovery timelines.
In contrast to the expected LOS baseline of 5.2 days, the instance shown in
Figure 7 presents an outlier case that predicts a prolonged length of stay of 16.6 days. The dominant driver was cumulative urine output (Uo Total T-5), which alone added over 7 days to the estimate, likely serving as a proxy for high-volume fluid resuscitation and physiological recovery. This primary factor is reinforced by the ongoing need for respiratory support (oxygen flow devices and mechanical ventilation).
3.3. Septic Shock Task
For the septic shock prediction task (
Table 7), the gradient boosting algorithms demonstrated superior performance compared to deep learning architectures. XGBoost achieved the highest discriminative ability across all metrics, with an AUROC of 0.955 and AUPRC of 0.799. This high Area Under the Precision-Recall Curve is particularly significant for clinical implementation, as it indicates the model’s robustness in minimizing false alarms while accurately detecting the minority class (shock events). Although the transformer model remained competitive (AUROC 0.947), it exhibited a noticeable drop in precision-recall performance (0.742) compared with the tree-based ensembles, reinforcing XGBoost as the most reliable candidate for an early warning system in this context.
In the confusion matrix for the XGBoost model,
Figure 8, the model successfully excluded 36,704 patients (True Negatives), generating only 551 False Positives. This implies a high Positive Predictive Value, indicating that when the system signals a risk of septic shock, it is highly reliable.
However, the high number of 1,731 false negatives at the standard decision threshold () suggests that the model was conservative. Given the high AUPRC (0.799), this indicates that for clinical screening purposes, the decision threshold could be reduced to capture more high-risk patients without causing an uncontrollable increase in false alarms.
Given the life-threatening nature of septic shock, the decision threshold could be lowered (e.g., to 0.2 or 0.3) to prioritize sensitivity. This adjustment would convert a significant portion of False Negatives into True Positives, ensuring earlier intervention for high-risk patients, although at the cost of a managed increase in alert frequency.
Global feature importance analysis (
Figure 9) confirmed that the XGBoost model prioritizes physiological hallmarks of septic shock. Lactic Acid levels at multiple time points, notably T-5 and T-0, emerged as the dominant predictors. This was closely followed by Mean Arterial Pressure (MAP) features, validating the model’s focus on hemodynamic instability.
The SHAP beeswarm plot,
Figure 10, extends this analysis by visualizing the directional impact of these biomarkers. This reveals a distinct pattern for Lactic Acid, where elevated values (represented by red points) consistently shift the predictions toward a higher probability of septic shock.
Conversely, Mean Arterial Pressure (MAP) exhibited a clear inverse relationship: lower values (blue points) aligned with positive SHAP values, correctly identifying hypotension as an important factor. This bidirectional validation, high lactate level, and low pressure arterial confirmed that the XGBoost model effectively learned the characteristics of patients with septic shock.
In this case, as shown in
Figure 11, the model successfully avoided a false-positive alert and calculated a final probability
, which remained slightly below the population baseline
.
The dominant protective effect of Lactic Acid (blue bar) counterbalanced the drops in blood pressure, allowing the algorithm to correctly conclude that, despite hypotension, the patient was not at immediate risk of septic shock.