4. Results and Discussion
This section provides detailed analysis on model performance, uncertainty, interpretability, and counterfactuals.
4.1. Model Evaluation
Table 2 shows the performance comparison of the five models—XGBoost Dart, LightGBM, HistGB, TabNet, and the Causal-Guided Stacked Classifier (CGSC). It reveals distinct strengths and weaknesses across key performance metrics. XGBoost Dart demonstrates robust performance with a precision of 0.76, recall of 0.80, and an F1-score of 0.78, denoting a well-balanced trade-off between identifying true positives and minimizing false positives. The accuracy of the model is 0.78 and AUC is 0.84, which further underscore its reliability as a general-purpose classifier. LightGBM, while exhibiting a slightly lower precision (0.71) and recall (0.76), achieves an exceptional AUC of 0.97, suggesting superior discriminative power in distinguishing between classes, albeit with a marginally lower accuracy of 0.72. HistGB shows moderate performance across all metrics, with precision, recall, and F1-score each hovering around 0.72–0.73, alongside an accuracy of 0.72 and an AUC of 0.78. The ROC-AUC curve of the ensemble models is shown in
Figure 5
TabNet stands out with the highest precision (0.79) among the models. It reflects the model’s ability to minimize false positives, though its recall of 0.73 is comparatively lower, that indicates a potential shortfall in capturing all positive instances. Its F1-score of 0.76 and accuracy of 0.77 combined with a strong AUC of 0.86 suggest that is well-suited for applications where precision is paramount. The CGSC model, which integrates Logistic Regression and kNN as base learners with LightGBM as a meta-learner, achieves the highest recall (0.81) but the lowest precision (0.70). This inverse relationship highlights its tendency to prioritize identifying true positives at the cost of increased false positives, which turns it particularly useful in domains like early disease diagnosis where missing a positive case is more detrimental than occasional false alarms. Its F1-score of 0.75 and accuracy of 0.73 are competitive, while its AUC of 0.80 indicates reasonable discriminative capability.
Figure 6 illustrates the ROC-AUC curve of these two models.
For an early warning diabetes prediction task using non-clinical features, the optimal model should prioritize high recall, good precision, and strong AUC. So, in terms of recall, CGSC is the best model, where for a balanced performance, TabNet and XGBoost shines. However, early warning systems benefit most from high recall as missed cases can lead to preventable complications. CGSC’s recall-driven performance aligns best with this goal, while XGBoost provides a safer middle ground if precision cannot be sacrificed entirely.
In the uncertainty plots presented in
Figure 7, each of the four models—CGSC, XGBoost DART, HistGB, and LGBM—was evaluated over 100 runs in terms of their F1-score variability.
The CGSC model shows a relatively narrow F1-score range of approximately 0.70 to 0.76. Its average performance seems to cluster around 0.735 to 0.74, and although there are a few dips, they are not frequent or severe. This reflects a stable and consistent behavior across repeated executions.
On the other hand, the XGBoost DART model, reaches higher F1-score peaks, ranging from about 0.69 to 0.82, with its mean likely around 0.76 to 0.77. However, the model also exhibits significant fluctuations, that shows frequent drops in performance alongside its high scores. This result suggests a higher variance in behavior which could lead to unpredictable outcomes if not controlled or averaged out.
The HistGB model demonstrates the most erratic performance of the four. Its F1-scores fluctuate between approximately 0.67 and 0.81, with a presumed average close to 0.735. This model suffers from frequent sharp dips and spikes that denotes considerable instability and variance across runs. Such unpredictability could make it less desirable for applications like diabetes diagnosis that requires reliability.
Finally, the LGBM model operates within an F1-score range of roughly 0.66 to 0.78. Its average seems to fall between 0.735 and 0.74, and although it displays performance swings, they are less extreme than those seen in HistGB. Nevertheless, its moderate-to-high variance indicates a certain level of unreliability, albeit to a lesser extent.
Therefore, CGSC emerges as the best option for consistency and robustness in terms of overall preference, with the most stable F1-score across the 100 runs. It performs reliably and avoids dramatic fluctuations, making it well-suited for scenarios where predictability is crucial. Conversely, XGBoost DART delivers the highest individual F1-score performances and could be the optimal choice when maximizing peak accuracy is a priority, provided that its higher uncertainty is managed—potentially through ensemble methods or additional tuning. HistGB stands out as the least stable model, with its highly volatile performance making it a less favorable choice in most practical settings. LGBM sits between the extremes by showing moderate reliability but not excelling significantly in either consistency or peak performance.
4.2. XGBoost DART Global Interpretability
In order to enhance the transparency and trustworthiness of the diabetes prediction framework, SHAP (SHapley Additive exPlanations) is utilized to interpret the XGBoost DART model globally and locally. SHAP values, grounded in cooperative game theory, attribute a contribution value to each feature by estimating how much each one shifts the model’s output from the base value.
The SHAP summary plot in
Figure 8 visualizes both the importance and direction of influence of each feature. The horizontal axis represents the SHAP value, which denotes the impact of each feature on the model’s prediction. Features are ordered vertically based on their overall contribution across the dataset.
Each point on the plot corresponds to an individual prediction, where the color represents the feature value (red for high, blue for low). A positive SHAP value implies that the feature increases the probability of predicting diabetes, whereas a negative value denotes the opposite.
As observed, GenHlth (General Health), BMI, Age, and HighBP (High Blood Pressure) are the most influential features. Individuals with poor general health (high feature value in red) tend to have positive SHAP values. It explains that they are more likely to be predicted as diabetic. Similarly, higher BMI and older age also contribute positively to the model’s prediction. Conversely, higher levels of physical activity (PhysActivity) and lower cholesterol levels (HighChol) are associated with negative SHAP values. These features act as protective factors.
The model effectively captures non-linear dependencies and feature interactions. For example, even though high blood pressure generally contributes positively to diabetes risk, there are low-value instances (in blue) that occasionally have a small positive SHAP value that reflects contextual interactions with other features.
This global SHAP interpretation validates the clinical plausibility of the model’s reasoning and identifies the most salient risk factors for diabetes in the dataset. It also ensures that the model decisions align with domain knowledge, which strengthens its credibility in real-world deployment. However, real-world deployment is beyond the scope of the research for now.
4.3. XGBoost Dart Local Interpretability
The model’s decision-making process for individual predictions using local SHAP explanations is also investigated to complement the global interpretability analysis.
Figure 9 shows a waterfall plot for a single instance that was classified as
not diabetic (class 0) by the XGBoost DART model.
The SHAP framework decomposes the model output into a sum of contributions from each feature relative to the expected prediction. The expected value of the model output is . It represents the average model prediction across the dataset. For this specific individual, the final model prediction is , which strongly pushes the prediction towards the non-diabetic class.
Each bar in the waterfall plot corresponds to a feature’s SHAP value contribution. Blue bars indicate features that pushed the prediction lower (towards class 0), whereas red bars indicate features that pushed the prediction higher (towards class 1).
The most influential feature is BMI, which contributes to the model output, which suggests a relatively low BMI for this individual. This is followed by General Health (GenHlth), which adds another , it points to the fact that the person self-reported good general health. Age has a slight positive contribution of , denoting older age slightly increases the likelihood of diabetes, but not enough to override the strong negative contributions from other features.
Other negative contributors include HighBP (), HighChol (), and PhysActivity (), suggesting that the individual does not have high blood pressure or cholesterol and is physically active—all of which are consistent with a lower risk of diabetes. The remaining features such as Stroke, HeartDiseaseorAttack, and PhysHlth also have slight contributions.
This localized interpretation ensures transparency at the individual level that confirms the alignment of model’s prediction with clinical reasoning. It further enables trust in model deployment, especially in sensitive healthcare settings where individualized decisions matter.
Similarly, the local SHAP explanation illustrated in
Figure 10 corresponds to an individual who was predicted as
diabetic (class 1) by the XGBoost DART model. The model’s expected output is
, whereas the final prediction for this instance is significantly higher,
, indicating a high likelihood of diabetes.
The feature with the highest positive contribution is General Health (GenHlth), which adds to the prediction. This suggests the individual reported poor general health that is indeed a strong risk factor associated with diabetes. BMI also positively contributes that implies an elevated body mass index, another major risk indicator. Although Age contributes negatively (), as category 5 reflects a younger age, this is outweighed by the stronger positive contributions.
Additional features such as HighBP (), PhysActivity (), and PhysHlth () also pull the prediction higher. This insight suggests an association of limited physical activity or physical health issues with diabetes risk. Conversely, HighChol contributes negatively, it indicates that normal cholesterol somewhat mitigated the prediction, though not enough to change the final outcome.
This SHAP plot reaffirms the model’s decision in an interpretable manner, showcasing how combinations of risk factors (particularly poor general health and high BMI) drive a high-confidence diabetes prediction.
4.4. TabNet Global Interpretability
The bar chart provided in
Figure 11 visualizes the global feature importance scores generated by TabNet. The x-axis represents the importance score which is a normalized value that sums to 1 across all features, and the y-axis lists the input features used in the model. The longer the bar, the more frequently or more significantly the feature was used in TabNet’s decision-making process.
From the plot, the feature GenHlth (General Health) has the highest importance score which explains that the model relies heavily on this feature when predicting the likelihood of diabetes. This makes sense in a healthcare context since general health status often correlates with chronic conditions like diabetes.
BMI and HighBP (high blood pressure) are also highly weighted, consistent with known risk factors for diabetes. On the other hand, features such as PhysActivity and MentHlth have very low importance scores. It implies that they contribute little to the predictive power for early warning of diabetes. The global feature importance in TabNet provides both an interpretable and theoretically grounded summary of which input features influence the model’s decisions. Fortunately, interpretability lead by the sparse attention mechanism is intrinsic to the model architecture and does not require post hoc explanations like SHAP or LIME. As a result, it is possible to directly trust the attribution scores to guide feature selection, model understanding, or communication of results to domain experts.
4.5. TabNet Local Interpretability
Figure 12 illustrates the local feature importance derived from a TabNet model when predicting an instance as diabetic (class 1). The horizontal bar plot displays the importance scores of each input feature. It mentions how significantly each feature contributed to the model’s decision for this particular individual.
From the plot, it is explicit that PhysHlth (Physical Health) is the most influential feature in the model’s prediction, with an importance score approaching 3.5. This suggests that the number of physically unhealthy days reported by the individual played a dominant role in classifying them as diabetic. The next most impactful features are BMI (Body Mass Index) and GenHlth (General Health), which denotes that higher body mass and poorer self-rated general health are also critical in determining diabetes presence for this subject.
Other features with notable contributions include Smoker and HighBP (High Blood Pressure), which align with known risk factors for diabetes. The variables HighChol, DiffWalk (Difficulty Walking), and MentHlth (Mental Health) show moderate influence, suggesting some relationship but to a lesser extent.
Features such as Age, CholCheck, PhysActivity, Fruits, Veggies, HvyAlcoholConsump, and HeartDiseaseorAttack contributed minimally in this individual case. These variables may be important at a population level, but their low scores here indicate limited relevance in the model’s specific decision for this person.
This analysis emphasizes the personalized interpretability provided by TabNet that adapts the contribution of features based on the unique characteristics of the individual case. It allows for more precise clinical insights and decision-making support.
On the other hand,
Figure 13 illustrates the local feature importance scores generated by the TabNet model for an individual who was classified as not having diabetes. This interpretation allows us to examine which features contributed most significantly to the model’s prediction for this specific instance.
Among all features, GenHlth (General Health) holds the highest importance score by a considerable margin, indicating that the individual’s self-reported general health status played a dominant role in the model’s decision. The importance score for GenHlth exceeds 2.7, far surpassing that of any other feature, Which suggests that better perceived general health likely contributed to the classification as non-diabetic. Other features with moderate importance include HvyAlcoholConsump (Heavy Alcohol Consumption), Fruits (fruit intake), and BMI (Body Mass Index), have importance scores ranging approximately between 0.6 and 0.8. These features likely provided additional evidence supporting the non-diabetic classification, possibly by aligning with healthier lifestyle patterns or lower obesity risk. Additional contributions came from PhysHlth (Physical Health) and HeartDiseaseorAttack, though their influence was lower in magnitude. Interestingly, features such as HighBP (High Blood Pressure), Stroke, Smoker, and PhysActivity registered negligible or near-zero importance, pointing to the fact that they did not influence the model’s decision for this particular instance.
This localized explanation highlights TabNet’s ability to assign dynamic importance to features based on the specific characteristics of the input data. In this case, it shows a heavy reliance on self-reported health and lifestyle indicators over clinical history or demographic information when predicting the absence of diabetes.
4.6. Diverse Counterfactual Explanations
DiCE (Diverse Counterfactual Explanations) framework is employed to better understand the decision boundary of the predictive model.
Figure 14 presents a counterfactual analysis where the goal was to identify minimal yet diverse feature changes that would alter the prediction of a given individual from non-diabetic (class 0) to diabetic (class 1).
The first row of the table corresponds to the original instance classified as non-diabetic. Notably, this individual has a low BMI of 20.0, does not suffer from high blood pressure, stroke, or walking difficulties, and reports excellent general health (GenHlth = 0.0, indicating "excellent"). The physical health burden (PhysHlth) is minimal, and lifestyle indicators such as PhysActivity and Fruits intake appear favorable.
The subsequent rows display counterfactual examples that lead to a prediction of diabetes (class 1) while keeping most features constant. Across all three counterfactuals, the feature BMI increases dramatically, in one case reaching as high as 72.2, which indicates morbid obesity. This is the most prominent change and a likely causal factor in the altered prediction. In one counterfactual, there is also a significant rise in the PhysHlth score (14.3), suggesting more days of poor physical health in the past month, which contributes further to the risk profile.
Interestingly, features like Age, Smoker, HeartDiseaseorAttack, and HighBP remain unchanged, emphasizing that for this particular individual, obesity and physical health degradation alone were sufficient to flip the classification.
This counterfactual analysis supports the earlier local interpretability findings by reinforcing the high sensitivity of the model to variables like BMI, and PhysHlth. Moreover, it provides actionable insights that indicates that substantial weight gain and declining physical condition could move an individual from a non-diabetic to a diabetic risk category in the model’s view.
Figure 15 presents examples on counterfactual instances that would flip the prediction to non-diabetic (class 0) from diabetic (class 1).
The original individual is classified as diabetic and exhibits extremely high BMI (72). This high body mass index is a consistent factor driving the diabetic classification. Additional characteristics such as lack of physical activity (PhysActivity = 0), poor general health (GenHlth = 4), and elevated physical health burden (PhysHlth up to 30) further reinforce the model’s diabetic prediction.
Among the three counterfactual instances that revert the outcome to non-diabetic, the third counterfactual shows a drastic reduction in BMI from these elevated levels down to a healthy value of 14.5. PhysHlth also is slightly reduced to 26.6 (27) from 30. These changes appears sufficient to alter the model’s classification. Notably, all other features remain unchanged, including lifestyle factors such as Smoker, PhysActivity, and HvyAlcoholConsump, as well as clinical history like HighBP, Stroke, and HeartDiseaseorAttack.
This finding suggests that, for this individual, body mass index is a critical determinant in the model’s assessment of diabetes risk. The fact that only a reduction in BMI leads to a change in outcome underscores the model’s strong sensitivity to obesity-related features. It also reinforces the conclusion from the local feature importance analysis and previous counterfactuals, where BMI consistently emerged as a pivotal variable. Such analysis not only provides interpretability but also offers actionable insight: weight reduction alone might suffice to transition an individual’s model-based diabetes risk profile from high to low, at least from the model’s perspective.
4.7. Causal Inference
Table 3 presents the Average Treatment Effect (ATE) for each feature, sorted by their absolute values. This allows identification of both strong positive and negative contributors to diabetes prediction.
The table presents that, on average, one unit increase (health decreasing one unit) in Genhlt increases the probability of diabetes by 13.92%, adjusting the corresponding confounders. Similarly, if the individual ever suffered from heart disease or stroke, the probability of diabetes increases by 13.36% and 12.58% respectively. Increment of high cholesterol and high blood pressure also increases the chances of diabetes by 9.75% and 7.79% respectively. Surprisingly, an ATE of -0.1875 for the Alcohol feature means that individuals with high alcohol consumption (value = 1) are, on average, 18.75% points less likely to be predicted as diabetic compared to those with low alcohol consumption (value = 0), controlling for all other confounders. This result may appear counterintuitive, which raises a few possibilities such as confounding effects (younger, healthier individuals might also report higher alcohol use), selection bias (perhaps heavy drinkers underreport symptoms or do not get diagnosed), or measurement issues or reverse causality (people with diabetes might reduce drinking). It is also can be noticed that checking up on cholesterol is associated with 25.83% points increase in the probability of diabetes. However, it does not mean that checking cholesterol causes diabetes. Rather, individuals who are at risk, have symptoms, or have comorbidities are more likely to undergo such screenings.