2. Materials and Methods
This study used data from the National Health and Nutrition Examination Survey (NHANES) for the period 1988–2018, including biochemical, demographic, and behavioral parameters of respondents. This source provides a representative dataset on the health status of the US population and is widely used for scientific purposes [
13,
14]. The study included routine biochemical markers, such as ferritin, glucose, γ-glutamyltransferase (γ-GT), and lactate dehydrogenase (LDH), as well as data on lifestyle, body weight, and physical activity level [
15]. Comprehensive data preprocessing was carried out: removing outliers, eliminating gaps, standardizing numerical features, and coding categorical variables, which is a necessary step for building reliable machine learning models [
16,
17]. Based on the prepared data, different machine learning models were trained and compared to predict the risk of elevated aspartate aminotransferase (AST) [
18].
2.1. Related Work and Literature Review
In recent years, there has been an increase in research aimed at predicting liver enzyme levels, including aspartate aminotransferase (AST), using machine learning methods. Hu et al. [
19] found an association between elevated ALT/AST ratio and the risk of liver fibrosis based on NHANES data but did not focus on predicting individual AST levels. Zhu et al. [
20] proposed a Random Forest model to estimate the risk of elevated transaminases in patients with rheumatoid arthritis, achieving high accuracy but limited to a narrow clinical cohort. A broader approach was proposed by Yang et al. [
21], who developed machine learning models for diagnosing MASLD using routine data; AST levels were considered only indirectly. Interpretable models are also gaining momentum. Wang et al. [
22] utilized SHAP to explain predictions from NAFLD ML models, demonstrating the potential of such solutions for medical interpretation. In turn, Ali et al. [
23] confirmed the possibility of diagnosing cardiovascular diseases using routine blood tests and ensemble models, but liver biomarkers were not the subject of analysis. In addition, Yang et al. [
24] presented a systematic review of the application of machine learning (ML) in predicting outcomes after liver transplantation, demonstrating advantages over traditional scoring systems; however, they did not address the aspect of routine screening. Khaled et al. [
25] proposed a deep learning system for the early detection of liver diseases, which requires further clinical validation. McGettigan [
26] investigated the performance of various machine learning (ML) models on an extensive array of medical data, confirming the potential of the algorithms for liver diagnostics but without specifying the specific architectures. Farhadi et al. [
27] focused on predicting complications after recovery from hepatitis B, which limits the generalizability of their model. As shown in
Table 1, all these studies emphasize the need to develop specialized and interpretable models that focus specifically on the individual risk of elevated AST, utilizing available clinical, demographic, and biochemical data.
As shown in
Table 1, the majority of current studies confirm the high efficiency of machine learning methods for analyzing and predicting diseases based on routine medical data. However, only a limited number of studies directly focus on individual prediction of AST levels as a separate biomarker. In addition, few studies use interpretable models, which limits their applicability in clinical practice. The identified scientific gaps, including insufficient generalizability, the lack of multivariate analysis, and the weak integration of behavioral parameters, underscore the need to develop a comprehensive model for predicting the risk of AST elevation based on available and standardized indicators. Despite the active development of machine learning in biomedicine, most existing studies focus on predicting diseases in tissues with high metabolic activity and do not consider aspartate aminotransferase (AST) as a significant marker of cardiovascular risk. This study presents an understudied yet promising approach for predicting cardiovascular diseases using AST and other routine indicators. Distinctive features of this work include:
Direct regression prediction of AST level, considered as an independent predictor of cardiovascular risk and not as a marker of hepatological disorders;
Integration of routine biochemical, anthropometric, and behavioral parameters, including inflammation, body weight, and lifestyle indicators, enhances the clinical relevance of the model;
Use of a stacking ensemble (Stacking v2), which combines the capabilities of modern algorithms and an interpretable meta-model to improve accuracy and stability;
The use of SHAP and mutual information to analyze the significance of features ensures the interpretability of the model and its applicability in the clinical environment;
Validation on a large and representative NHANES dataset (1988–2018) covering a wide range of health data from the US population.
Unlike previous studies limited to narrow clinical cohorts or liver disease diagnostic tasks, our study demonstrates how routine parameters, including AST, can be effectively used to assess cardiovascular disease risk in the general population. The proposed model may become a tool for early screening and personalized prevention in resource-limited settings.
2.2. Dataset Collection
To build a model for predicting aspartate aminotransferase (AST) levels, the open national dataset from the National Health and Nutrition Examination Survey (NHANES) for the years 1988–2018 was utilized. This source provides large-scale information on the health status of the US population, including biochemical analysis data, anthropometric indicators, and behavioral and demographic characteristics. NHANES was chosen due to its representativeness, standardized data collection protocols, and high degree of reliability. From the total data set, records of adult respondents (≥18 years) were selected for whom AST values and other essential indicators were available: ferritin, γ-glutamyl transferase (γ-GT), lactate dehydrogenase (LDH), glucose, body mass index, physical activity, smoking and alcohol consumption habits, as well as inflammation indicators. Incomplete observations and abnormal values were excluded, and data cleaning and standardization procedures were performed. All features are brought to a single format, categorical variables are coded, and numerical variables are normalized. As a result, a structured sample is formed, suitable for the application of machine learning algorithms. It covers various aspects of the physiological state and lifestyle of respondents, providing a basis for constructing an interpretable prognostic model. The original data presented at
https://drive.google.com/drive/folders/1cgyQXj3Kl7FdDoyPlmEkCKyDXNDIv4JB?usp=drive_link (accessed on 06 June 2025).
2.3. Rationale for a Method Selection
Building an effective predictive model required the use of a complex algorithm that included several interrelated stages of data preprocessing, feature selection, and model ensemble training. Each method in this process was selected based on its robustness, efficiency, and applicability to medical data with heterogeneous features.
1. Removing emissions. In the first step, observations with suspiciously high values (s ≤ 200) were excluded from the dataset, which helps minimize the impact of anomalies and noise on model training. This is especially important when working with biomarkers, where technical or clinical artifacts may cause outliers.
2. Removing gaps. Removing rows with missing values in the target variable (AST) and critical predictors ensures the correctness of the training process. This step is necessary to maintain the quality of predictions and prevent distortions.
3. Transformation of categorical variables. One-hot encoding of categorical features (e.g., demographic and questionnaire data) is applied, which allows them to be efficiently included in machine learning models without violating assumptions about the numerical nature of the input data.
4. Scaling of Numerical Features. Numerical features are normalized (z-transformed) to equalize scales and prevent features with high variance from dominating the analysis. This is especially important for linear and gradient-boosted models that are sensitive to scale.
5. Split into training and validation samples. The standard split of the sample (train/test split) is used to assess the quality of the model objectively. This allows you to control overfitting and tune hyperparameters.
6. Base models (Base regressors). The following algorithms were selected to build a forecast of the AST level:
• Linear Regression — a basic benchmark for estimating linear relationships.
• Random Forest — a stochastic model that is robust to outliers and works well with small samples.
• XGBoost — a powerful gradient boosting that provides high accuracy and control over overfitting.
• CatBoost — an optimized boosting algorithm that works efficiently with categorical features without the need for manual coding.
• LightGBM — a fast and scalable boosting algorithm, especially effective on large and sparse data.
• Extra Trees — an improved version of Random Forest that uses additional stochasticity to improve generalization.
7. Stacking. As shown in
Figure 1, Stacking v1 is a simple two-level ensemble scheme in which base models (Linear Regression, Random Forest, and XGBoost) are independently trained on the original features (1):
The meta-model — simple linear regression — receives their predictions as input and minimizes the error on the validation set. The final forecast is formed based on the aggregated predictions of the meta-regressor. The advantages of this approach include ease of implementation, high interpretability, accelerated convergence, and quality gain due to combining several models. However, Stacking v1 is limited by the linear nature of the meta-regressor, does not take into account complex dependencies between the outputs of the base models, does not support feeding the original features directly to the meta-level ("passthrough"), and can also be subject to overfitting when using a large number of base algorithms.
Stacking v2 is an advanced two-level ensemble model that utilizes modern and powerful algorithms as base models, including CatBoost, LightGBM, and ExtraTrees, which provide high accuracy through boosting and stochastic approaches (2):
At the second level, a Ridge regression meta-model is used, which is robust to multicollinearity and prone to regularization, thereby reducing the risk of overfitting and accounting for the possible correlation between the predictions of the base models. The final prediction is formed based on the outputs of these three ensembles, aggregated using Ridge regression. Among the advantages of Stacking v2 are high accuracy, resistance to overfitting, and good adaptation to nonlinear dependencies (
Figure 2). The main disadvantages are the increased complexity of hyperparameter tuning and increased computational costs compared to simpler schemes such as Stacking_v1.
Figure 3 presents the most advanced and outlier-robust ensemble architecture, which includes powerful base models — CatBoost, LightGBM, XGBoost, Random Forest, and a specialized regressor for predicting high AST values (HighAST Regressor) (3):
All boosted models utilize Huber loss, which ensures robustness to outliers and asymmetric errors. LightGBM is used as a meta-model with the same loss function and a "passthrough" mode, in which the meta-algorithm receives not only the predictions of the base models but also the original features, which allows it to effectively restore complex dependencies and compensate for the weaknesses of individual models. The final forecast is formed based on cumulative information, making this scheme the most robust against various types of errors. Its advantages include high accuracy, robustness in the face of outliers, the ability to utilize rare patterns (via the HighAST model), and a rich feature representation. However, the model requires significant computational resources, careful tuning of hyperparameters, and effective control over overfitting. Despite its complexity, the model did not show substantial advantages on validation data for several key metrics (R², RMSE, MAE), indicating the need for additional analysis and possibly architecture refinement.
Overall, the Stacking_v2 architecture was optimal in terms of the combination of accuracy, stability, and interpretability criteria. It is recommended to use this option for predicting the AST level using the presented markers and features.
2.4. Stages of Model Implementation
Proper feature selection plays a key role in building accurate and interpretable machine learning models, especially in the clinical context, where each variable can reflect critical biomedical processes. Based on biochemical, anthropometric, and behavioral characteristics, as well as the use of explainable AI approaches, an assessment is carried out to evaluate their contribution to the predicted variable. This analysis not only improves the quality of prediction but also identifies pathophysiological relationships that are crucial for interpreting results and informing clinical practice.
I. Notations and variables used
n — number of observations (patients);
d — number of initial features;
— feature matrix;
— feature vector for the i-th patient;
— target variable vector (ACT level, LBXSASSI).
Trait variables:
- ▪
— age;
- ▪
— gender;
- ▪
— weight;
- ▪
— height;
- ▪
— ferritin;
- ▪
— homocysteine;
- ▪
— total cholesterol;
- ▪
— LDL-C;
- ▪
— glucose;
- ▪
— hemoglobin;
- ▪
— creatinine;
- ▪
— hs-CRP;
- ▪
— alkaline phosphatase (ALP);
- ▪
— gamma-GT;
- ▪
— LDH;
- ▪
— urea;
- ▪
— uric acid;
- ▪
— leukocytes;
- ▪
— average number of alcoholic drinks per day;
- ▪
— physical activity.
Figure 4 shows the complete architecture of the algorithm for constructing a predictive model for the aspartate aminotransferase (AST) level, including the stages of data preprocessing, feature selection, training of base models, and formation of the Stacking_v2 ensemble. The algorithm begins by removing outliers and missing values, then encodes categorical variables and scales numerical features. After splitting into training and validation samples, several regressors are trained in parallel (including XGBoost, CatBoost, LightGBM, ExtraTrees, etc.), and the final model is formed using meta-regression based on these models.
This algorithm enables the achievement of high accuracy and robustness in the model by combining various algorithms and normalization methods. Using Stacking_v2 ensures the efficient aggregation of base model predictions, thereby reducing the risk of overfitting and improving the model's generalization ability. The inclusion of such stages as outlier analysis and feature scaling is essential in medical data, where there is high heterogeneity. The algorithm architecture presents a comprehensive and interpretable approach to AST prediction, making it suitable for subsequent application in decision support systems.
II. Steps of data preparation and processing
1. Loading and merging data. The first stage involves loading three tables containing clinical, demographic, and questionnaire data, after which they are combined using a unique patient identifier (SEQN), allowing for the formation of a single data structure for subsequent analysis and model building (4).
2. Remove outliers. Removes records where the target variable
(5):
3. Removing gaps. Only those records are left where there are no gaps for the selected features and target variable (6):
4. Transformation of categorical variables. Gender indicator is encoded using the one-hot method (7):
5. Scaling of numerical features. For each numerical feature (
)), except for categorical ones, standardization is applied (8):
6. Splitting into training and validation sets. The data is split into training (
) and validation (
) sets:
III. Mathematical description of model training
1. Training base regressors. For each base algorithm
, we train a regression function (10):
IV. Methods of selection and analysis of features
1. Correlation analysis
Stacking v2. The final model is a second-level ensemble combining the predictions of the base models using ridge regression as a meta-algorithm. This approach enables the extraction of advantages from different models, thereby increasing the stability of predictions and reducing error due to aggregation. The use of stacking is especially justified in problems where there is no single universal predictor, and it is necessary to combine knowledge from different sources. Taken together, the described methodological approach enables the construction of an interpretable, robust, and accurate model for predicting the AST level based on available routine data, offering high potential for clinical application.
3. Results
This study was conducted using the publicly available, harmonized NHANES 1988–2018 (National Health and Nutrition Examination Survey) dataset, which combines national data on the health and nutrition status of the US population over 30 years [
9]. Thanks to complex preprocessing, this resource ensures high comparability of variables and minimizes the impact of missing and erroneous values, which is critical for building valid machine learning models. The selection of features for analysis was carried out strictly by the goal of the study - to develop an accurate and accessible model for predicting aspartate aminotransferase (AST) levels based exclusively on low-cost and routine biochemical markers that are available as part of standard medical examinations, without the use of specialized and expensive cardiology tests.
The final dataset for building the prognostic model included the most accessible and clinically significant variables. Demographic characteristics, such as age and gender, were used as baseline covariates to account for norms and individual differences. Anthropometric indicators (body weight and height) reflected the metabolic load and served as indicators of the general physiological state. The set of features included key biochemical markers, such as ferritin, homocysteine, cholesterol, glucose, creatinine, hemoglobin, hs-CRP, urea, uric acid, and leukocytes, which characterize metabolism, inflammatory processes, kidney and liver function, as well as protein and carbohydrate metabolism. Additionally, inexpensive but informative markers of liver function (γ-GT, LDH, alkaline phosphatase) were included as a practical alternative to specialized liver tests. Behavioral parameters, such as the level of physical activity and alcohol consumption, were also taken into account to account for the influence of external factors. This selection of features is designed to maximize the practical applicability of the model, relying solely on data that are readily available in most clinical laboratories without the need for expensive equipment or specialized assays, making the proposed approach suitable for broad population screening. All variables were standardized, and categorical variables were one-hot encoded. Samples were combined by unique participant identifiers, with subsequent cleaning of missing values (
Table 2).
To build a predictive model, both basic algorithms and various ensemble schemes were tested. Both interpretable (linear regression) and more powerful ensembles — Random Forest, XGBoost, and CatBoost — were used as single models. In addition, three versions of stacking models were implemented, differing in the composition of the basic models, meta-algorithm, and the use of additional strategies. All models were tuned to optimize the parameters that strike a balance between accuracy and stability.
Table 3 presents brief characteristics of the models used and the key hyperparameters applied in the experiment.
As shown in
Table 3, each model had its specific settings and architecture. Simple models, such as linear regression, were primarily used for basic comparison, while ensemble approaches, especially Stacking_v2, yielded the best results across several metrics. Of particular interest are complex configurations, such as Stacking_v3, and a specialized regressor for the subsample with high AST values, focused on handling outliers and complex cases. This multi-level approach allowed us to comprehensively evaluate the potential of various algorithms in predicting biomarkers on routine data. A set of validated metrics was used for a comprehensive assessment of the predictive abilities of the trained models. The forecasting quality was determined by the coefficient of determination (R²), reflecting the proportion of explained variance of the target variable, as well as by the root mean square error (RMSE) and mean absolute error (MAE), which allow us to quantify the average deviations of predictions from the actual values in the original AST units. In addition, the relative error (MAPE) was calculated and expressed as a percentage, ensuring comparability between different ranges of values and increasing the clinical interpretability of the results. Additionally, the explained variance was estimated, characterizing the model's stability in the face of fluctuations and outliers in the data.
Figure 5 shows the values of the determination coefficient (R²) for the training and validation sets for all tested models: Linear Regression, Random Forest, XGBoost, CatBoost, and three stacking options.
The highest value of the coefficient of determination (R²) on the training set was achieved by the XGBoost model (R² = 1.00), indicating that it fully explains the variance of the target variable within the training dataset; however, the decrease in R² to 0.92 on the validation set indicates possible overfitting, despite maintaining high predictive accuracy. The Random Forest and CatBoost models also demonstrated high R² values (0.92 and 0.99 on training and 0.84 and 0.90 on validation, respectively), confirming their generalization ability, albeit inferior to that of XGBoost. Ensemble approaches, especially Stacking_v2, provided the best balance between accuracy on training (0.99) and validation (0.98), demonstrating resistance to overfitting due to the integration of predictions from multiple base models. In contrast, the linear regression model showed significantly lower values (0.48 for training and 0.42 for validation), indicating its insufficient flexibility in the face of complex nonlinear relationships. The Stacking_v3 model, despite its complex architecture, achieved only 0.73 for training and 0.58 for validation, which is likely due to the overcomplication of the structure and a decrease in consistency between the ensemble levels. Thus, the most stable and highly accurate results are achieved when using modern boosting and stacking models with well-chosen parameters and a relevant set of features.
Figure 6 shows a comparison of the root mean square error (RMSE) values on the training and validation samples for different models. RMSE enables us to assess the absolute accuracy of predictions: the lower the metric value, the higher the model's quality.
The RMSE metric analysis revealed the highest error value for the linear regression model (5.94 on the training set and 7.55 on the validation set), indicating its inability to model complex relationships between features and the target variable adequately. At the same time, the Random Forest, CatBoost, and especially XGBoost models showed significantly better results: the minimum RMSE value for validation belongs to XGBoost (2.75), with zero error for training, which, despite the high accuracy, may indicate overfitting. CatBoost demonstrated balanced values (0.90 - training, 3.18 - validation), only slightly inferior to XGBoost. Among the ensemble stacking models, the best performance was shown by the Stacking_v2 architecture (0.63 - training, 1.23 - validation), which indicates its high generalization ability and successful configuration. In contrast, Stacking_v3 showed inflated errors (4.28 — training, 6.43 — validation), likely due to excessive complexity and inefficient coordination between ensemble layers. Thus, XGBoost and Stacking_v2 proved to be the most effective in terms of root mean squared error (RMSE). At the same time, significant discrepancies between training and validation errors for individual models underscore the need for further work on hyperparameter tuning and preventing overfitting.
Figure 7 shows the analysis of mean absolute errors (MAE) for the training and validation sets for different models. MAE reflects the average absolute deviation of predictions from actual values, allowing us to judge the accuracy of the models without considering the direction of error.
The analysis of the mean absolute error (MAE) metric revealed that the linear regression model yields the highest values on both the training set (4.21) and the validation set (4.88), indicating its limited predictive suitability for this task. In contrast, the Random Forest and CatBoost ensemble models demonstrated significantly lower MAE values — 1.56 and 0.39 on training and 2.71 and 1.57 on validation, respectively — confirming their ability to account for complex nonlinear dependencies. The XGBoost model exhibited minimal error values (0.00 on training and 1.03 on validation), demonstrating high accuracy and generalization ability with the correctly selected parameters. Stacking models also performed well: Stacking_v1 (1.31 — train, 2.58 — val) and especially Stacking_v2 (0.40 — train, 0.90 — val), the latter of which provides one of the best accuracy-to-robustness ratios. In contrast, the Stacking_v3 model showed relatively high errors (1.72 — training, 3.02 — validation), which is likely due to the complex architecture and weak synergy between the ensemble components. Overall, the XGBoost and Stacking_v2 models demonstrated the best MAE values in validation, making them preferable for accurate and robust AST activity prediction, especially in the context of medical data with high sensitivity to outliers.
Figure 8 presents a comparison of the relative prediction error (MAPE, %) for various models on both the training and validation sets. This metric is particularly sensitive to the magnitude of the deviation between predictions and actual values. It is critical for clinical and biomedical problems where it is essential to consider the percentage error.
The mean average prediction error (MAPE) analysis revealed that the linear regression model performed the worst, with 20.27% on the training set and 19.87% on the validation set, confirming its limited applicability to AST activity prediction tasks. In contrast, ensemble methods, in particular Random Forest, significantly reduced the error rate (7.43% — train, 11.26% — val), demonstrating robustness to data heterogeneity. The XGBoost model achieved the best performance (0.02% — train, 3.77% — val), demonstrating its high ability to model complex relationships accurately. Comparably low values were shown by CatBoost (1.75% — train, 5.85% — val) and Stacking_v2 (1.95% — train, 3.85% — val), highlighting the effectiveness of advanced ensemble architectures. At the same time, Stacking_v1, despite achieving acceptable results in training, demonstrated a significantly higher MAPE in validation (10.94%), which may indicate an insufficient model complexity or overfitting. Stacking_v3 is also inferior in accuracy (6.08% — train, 10.13% — val), despite its complex structure. Thus, XGBoost and Stacking_v2 are recognized as the most accurate and robust models for individual AST level prediction, especially in the context of precision medicine and epidemiological studies.
Figure 9 shows a comparison of the Explained Variance scores for different models on the training and validation sets, reflecting the proportion of variance in aspartate aminotransferase (AST) levels explained by the model. Linear regression showed the lowest values of 0.48 for training and 0.43 for validation, indicating its limited ability to capture complex dependencies in medical data. In contrast, the Random Forest and CatBoost models provided significantly higher explained variance, at 0.92 and 0.99 for training and 0.85 and 0.91 for validation, respectively. These results demonstrate the effectiveness of ensemble and boosting approaches in modeling physiological processes with high variability, confirming their applicability to individual medical prediction problems.
The XGBoost and Stacking_v2 models demonstrate the highest Explained Variance values on the training set (1.00) and some of the highest results on the validation set (0.92 and 0.98, respectively), which indicates the high ability of these algorithms to take into account complex nonlinear dependencies between features and almost explain entirely the variability of the aspartate aminotransferase (AST) level. At the same time, Stacking_v1 and Stacking_v3 show more modest values (0.94/0.87 and 0.74/0.62 for the training and validation sets, respectively), which can be due to either excessive complexity of the ensemble structures or insufficient consistency between the base and meta-levels of the models. The best explained variance values are achieved by XGBoost, CatBoost, and Stacking_v2, which confirms their suitability for high-precision risk stratification, interpretation of significant predictors, and construction of reliable individual predictions in medical analytics tasks.
Based on the conducted analysis, the XGBoost and Stacking_v2 algorithms demonstrate the highest accuracy and stability among the tested models. At the same time, the final choice of the optimal model should take into account not only the metric values on the training set but also stability during validation, the tendency to overfit, and the interpretability of the results. XGBoost shows almost ideal training quality (R²=1.00, RMSE=0.00, MAE=0.00, ExplainedVar=1.00). However, a moderate decrease in indicators is observed on the validation set (R² = 0.92, RMSE = 2.75, MAE = 1.03), which still indicates a high generalizing ability with minimal risk of overfitting. At the same time, the Stacking_v2 model outperforms XGBoost in several validation metrics (R² = 0.98, RMSE = 1.23, MAE = 0.90, MAPE = 3.85), demonstrating stable performance and adaptability to complex data structures. This is achieved through the use of powerful base models (CatBoost, LightGBM, ExtraTrees) in combination with Ridge meta-regression, which reduces overfitting and accounts for the multicollinearity of features. Linear regression yielded the worst results among all the tested models, which is explained by its inability to accurately reflect the complex nonlinear dependencies between biochemical and behavioral parameters. Low R² values (0.48 in training and 0.42 in validation), high RMSE and MAE values, as well as a significant percentage of mean absolute error (MAPE), confirm its limited predictive capabilities in medical data. Thus, the set of tests and metrics performed supports the choice of Stacking_v2 as the main predictive model. This architecture provides an optimal balance between accuracy, interpretability, and resistance to overfitting, which is especially important when predicting aspartate aminotransferase levels in screening and clinical studies.
3.1. Importance of Features, Their Interactions, and Correlations
Figure 10 shows a comparative matrix of feature significance using the linear correlation, mutual information, and SHAP methods (for XGBoost and Random Forest), reflecting the contribution of variables to predicting the AST level. The most significant in all approaches were gamma-glutamyl transferase (LBXSGTSI), ferritin (LBXFER), and lactate dehydrogenase (LBXSLDSI), indicating their key role in the biochemical assessment of liver function. Also, anthropometric parameters such as height (BMХHT) and weight (BMХWT) showed high predictive potential, which may be associated with metabolic load.
Moderate significance is observed for parameters such as physical activity level (PAD615), low-density lipoproteins (LDL), as well as several behavioral and biochemical parameters (e.g., ALQ130, LBXCRP, LBXSBU). The least informative variables were RIAGENDR (gender), LBXHGB (hemoglobin), and LBXCRP (hs-CRP), which demonstrated low or negative significance values in most methods, indicating their weak association with AST levels in this population. Visualization of the differences between methods revealed that SHAP (especially for XGBoost) emphasizes the significance of individual biomarkers, such as LBXSGTSI and LBXFER. At the same time, mutual information confirms their relevance from the standpoint of nonlinear dependencies. Linear correlation underestimates the role of markers sensitive to nonlinearity but also confirms the significance of LBXSGTSI and LBXFER. Taken together, this allows us to conclude that the main predictors of AST are markers of liver cytolysis (LBXSGTSI, LBXSLDSI), metabolic parameters (LBXFER), and anthropometric parameters (BMХWT, BMХHT). The consistency of results between methods increases the reliability of the identified associations and confirms their practical value for screening and clinical use.
Figure 11 shows a bee swarm plot of SHAP values demonstrating the contribution of individual features to predicting aspartate aminotransferase (AST) levels using the XGBoost model. The leading predictor is LBXSGTSI (gamma-glutamyl transferase), where high feature values are associated with increased AST, emphasizing its role as a key biomarker of liver cell lysis. Ferritin (LBXFER) and anthropometric parameters (height and weight), reflecting metabolic load, also have a significant impact. The physical activity parameter (PAD615) is positively associated with AST in some cases, which may be due to physiological adaptations or microdamage. An average contribution is observed for LDH, LDL, and hs-CRP, while features such as gender, creatinine, and some biochemical parameters have a minimal impact. This confirms that AST variability in the NHANES cohort is predominantly determined by liver function, metabolism, and individual physiological characteristics.
Overall, the visualization confirms that a combination makes the most significant contribution to AST variability of liver and metabolic markers, as well as parameters characterizing inflammation and lifestyle. The high dispersion of SHAP values for the leading markers emphasizes their importance for risk stratification and the potential for inclusion in clinical predictive models.
Figure 12 shows a diagram of the average modulus of the SHAP value for each feature, which allows us to rank their contribution to the prediction of aspartate aminotransferase (AST) levels by the XGBoost model. The higher the value of the column, the greater the average contribution (importance) of this feature to the final model.
Key findings from the mean SHAP ranking of features indicate that the most significant predictor of aspartate aminotransferase (AST) level is LBXSGTSI, specifically gamma-glutamyl transferase (GGT). This indicator consistently occupies a leading position, which confirms its clinical significance as a sensitive and specific marker of liver cytolysis. Ferritin (LBXFER) and anthropometric parameters (height - BMXHT, weight - BMXWT) also make a significant contribution, reflecting the impact of metabolic disorders and obesity on AST activity. Physical activity (PAD615) demonstrates a pronounced effect, which may be associated with both physiological adaptations of muscle tissue and potential damage that occurs during intense exercise. The following factors are most significant: LDH (LBXSLDSI), low-density lipoproteins (LBDLDL), the inflammatory marker hs-CRP (LBXCRP), leukocytes (LBXWBCSI), and the level of alcohol consumption (ALQ130), which indicate a complex regulation of AST, involving metabolic, inflammatory, and behavioral factors. Less significant were indicators such as alkaline phosphatase (LBXSAPSI), uric acid (LBXSUA), age (RIDAGEYR), creatinine (LBXSCR), and sex (RIAGENDR), which more informative variables may have influenced in the multifactorial model. The overall analysis confirms that the key determinants of AST variability are biochemical markers reflecting metabolic and inflammatory status, as well as physical activity indicators, which is consistent with the modern concept of risk stratification and interpretation of liver enzyme activities.
3.2. SHAP Interactions
Figure 13 illustrates the evaluation of the top 10 pairwise feature interactions, as determined by SHAP interaction values.
SHAP interaction analysis revealed that the most significant contributions to the AST level prediction were made by the BMXWT × LBXSGTSI and LBXSGTSI × BMXWT trait pairs (contribution > 0.19), which combined body weight and gamma-glutamyl transferase activity —key markers of metabolic and liver disorders. Significant interactions were also observed between LBXSGTSI and LBXSLDSI (~0.12), reflecting the synergistic effect of cytolytic enzymes. The LBXGLU × LBXFER interaction (~0.09) indicated a relationship between carbohydrate metabolism disorders and iron-containing proteins with AST activity. Less pronounced but significant pairs included LBXSBU, LBXFER, and BMXHT, emphasizing the importance of careful consideration of even secondary biomarkers. Overall, the identified interactions enhance the interpretability of the model and demonstrate the multifactorial nature of AST regulation.
3.3. Correlation Analysis
Figure 14 shows the matrix of robust correlations between AST and the studied biochemical and clinical markers. The matrix of robust correlations was calculated using four different methods: Spearman, Kendall, DistanceCorr, and MICe. The highest positive correlations with AST are observed for ferritin (LBXFER: 0.35 by DistanceCorr, 0.32 by Spearman), gamma-glutamyl transferase (LBXSGTSI: 0.42 by DistanceCorr, 0.35 by MICe) and lactate dehydrogenase (LBXSLDSI: 0.33 by DistanceCorr, 0.30 by MICe). These results reflect the established clinical relationship between cytolysis markers and AST levels, confirming their crucial role in diagnosing and monitoring liver conditions.
Pronounced correlations are also observed for hemoglobin (LBXHGB: 0.41 by MICe), which is probably due to its indirect effect on tissue respiration and metabolism in the liver. Negative correlations were recorded for parameters such as C-reactive protein (LBXCRP: -0.23 by Spearman), which may indicate complex relationships between inflammation and enzymatic activity of the liver, as well as gender (RIAGENDR: -0.22 by Spearman), which reflects physiological differences between men and women in the structure and functioning of the liver. In general, the markers of liver cytolysis, metabolic metabolism, and inflammation were the most informative in terms of correlations. This emphasizes the need for careful consideration of these indicators when constructing prognostic models of AST activity and also indicates a high biological validity of the selected features.
3.4. Clustering and Dendrogram
Figure 15 shows a dendrogram (Feature + AST Dendrogram, Ward Linkage) showing the hierarchical structure of relationships between the main features and the AST level obtained using the Ward method. The vertical axis includes all biomarkers, as well as demographic and behavioral variables. The horizontal axis represents the distance between clusters (ward linkage distance), allowing you to assess their degree of similarity visually.
The dendrogram analysis demonstrates the clustering of features corresponding to their biological and clinical nature. In the lower part, a cluster is revealed that unites biochemical markers of enzymatic activity and cellular damage (LBXSGTSI, LBXSASSI, LBXSLDSI, LBXFER), reflecting the integrative role of AST in assessing both hepatic and systemic cytolysis processes. This group is especially informative for the diagnosis of diseases accompanied by tissue necrosis, including both hematological and cardiac pathologies. The second large cluster includes metabolic and anthropometric indicators (BMXWT, BMXHT, LBDLDL), as well as biomarkers of chronic inflammation and metabolic disorders (LBXSUA, LBXHGB, LBXCRP), emphasizing the systemic effect of lipid and protein metabolism on the AST level. The third cluster combines behavioral and demographic variables (PAD615, ALQ130, RIAGENDR, RIDAGEYR), as well as glucose and related parameters (LBXGLU, LBXSAPSI, LBXWBCSI), highlighting the importance of lifestyle, age, and carbohydrate metabolism in regulating enzyme activity. Minimal distances between cytolysis markers indicate their close relationship and joint contribution to AST variability. More distant groups of features, despite a smaller relationship, also make a significant contribution due to metabolic, inflammatory, and behavioral factors. Thus, the dendrogram structure visualizes the multisystem nature of AST regulation, where the most crucial influence is exerted by enzymatic indicators of tissue damage, followed by metabolic and behavioral parameters. The resulting clusters can be used for more accurate stratification of patients and the construction of interpretable prognostic models in clinical practice.
3.5. Assessing Interpretability and Calibration
Figure 16 shows the calibration plot of the LGBM model, which allows us to assess the agreement between the predicted and observed AST levels by quantiles. The dotted line represents a perfect match between the predictions and observations, while the actual calibration line (blue line) displays the model's actual results. In most intervals of predicted values, there is a relatively high degree of agreement between the prediction and the exact values, indicating good calibration of the model in the range of low and medium AST values. However, in the region of high values (from 30 to 40), there is some discrepancy, where the actual values exceed the expected ones. This indicates a tendency of the LGBM model to underpredict patients with the most pronounced AST deviations slightly. The reason for this behavior may be both the relative rarity of high AST values in the training set and the difficulty of modeling extreme physiological states.
Overall, the graph confirms the adequacy of the model calibration for most clinically significant AST intervals, which is essential for the practical application of the prognostic model in population studies and medical screenings. Particular attention should be paid to further improving the model by adjusting predictions in the tails of the distribution, thereby improving the accuracy of the forecast for patients with atypically high enzyme values.
3.6. Predicting the Risk of Exceeding the AST Threshold
Figure 17 illustrates the ROC curve for the binary classification problem involving patients with an AST level of≥ 40 U/L. The area under the curve (AUC) is 1.000, which reflects the maximum possible discriminatory ability of the model. This result means that the model accurately distinguishes patients with pathologically elevated AST values from all other cases in the validation set. Binarization (AST ≥ 40 U/L) allowed us to test the diagnostic suitability of the models using the standard AUROC and PR-AUC metrics. The ROC curve line almost repeats the upper and left edges of the graph, indicating the absence of false-positive and false-negative decisions at the selected classification threshold. This level of prediction quality, on the one hand, demonstrates the model's high ability to identify clinically significant cases of elevated AST. On the other hand, it may indicate potential overfitting on the subsample under consideration or high homogeneity of the data structure for this feature. A key practical conclusion is that, with the current configuration of features and training set, the model can be used for screening and early detection of patients with severe liver dysfunction, as indicated by AST. To confirm the sustainability of this result, it is advisable to conduct validation on external independent cohorts.
The obtained AUC value of 1.000 requires careful interpretation, as such ideal values are scarce in clinical practice and may indicate features of the data structure or model overfitting. Possible reasons include the high homogeneity of the validation sample or the presence of pronounced features that divide groups by the AST level. To confirm the stability of the model, additional checks are necessary, including cross-validation, repeated random partitioning, and testing on external data. Nevertheless, the high sensitivity and specificity of the model open up prospects for its application not only in diagnostics but also in monitoring therapy and creating interpretable decision support systems in hepatology.
Figure 18 shows the Precision-Recall curve for the binary classification problem of patients with AST ≥ 40 U/L, with an average area under the curve (AUC) of 1.000. This result indicates 100% accuracy and recall in identifying positive cases among the entire sample.
A high AP (Average Precision) value indicates the model's exceptional ability to simultaneously achieve maximum recall (recall = 1.0) and accuracy (precision = 1.0), suggesting a complete absence of type I and type II errors on the validation set. Such a combination is scarce and typically results from the high information content of features, a low noise level in the data, or the clear separability of classes. The characteristic shape of the Precision-Recall curve with a sharp transition confirms the absence of a trade-off between accuracy and recall, which may also be due to a limited sample size or class imbalance. Such a result requires additional validation: retesting on independent data, cross-validation, and assessment of robustness to changes in the sample structure. Despite the seeming ideality, such high indicators should be interpreted with caution, especially in the context of medical problems, where overfitting can lead to false conclusions. Nevertheless, a high AP metric indicates the model's potential for screening and early detection of patients with abnormally high AST levels.
3.7. Mediator Analysis
Mediation analysis with ferritin (LBXFER) as a key mediator showed that its contribution to AST change is realized mainly in a direct way, without significant indirect influence through other routine markers.
Table 4 presents the following notations: Direct reflects the direct impact of the mediator on the AST level, Indirect is an indirect or mediated influence through intermediate variables, Total is a combined effect, including both direct and indirect influence, sig indicates the statistical significance of the impact (significant values are highlighted in bold), CI is a 95% confidence interval characterizing the reliability and stability of the assessment.
The results of the mediator analysis indicate that the most pronounced direct and indirect effects on AST levels are exerted by gamma-glutamyl transferase (LBXSGTSI), growth (BMXHT), urea (LBXSBU), alkaline phosphatase (LBXSLDSI), and glucose (LBXGLU). Ferritin (LBXFER) makes a significant contribution, primarily through the direct pathway, which is consistent with its established role as a marker of systemic inflammation and cellular cytolysis. For most other biomarkers, the indirect effect is weak or absent, which emphasizes the dominance of direct impacts in the formation of AST activity. The choice of mediators was motivated by several reasons. First, only statistically significant mediators were included in the analysis: the selection was carried out according to the criterion of the presence of at least one considerable pathway (direct, indirect, or total), with a p-value less than 0.05 and a confidence interval not crossing zero. Secondly, the selected features were characterized by a high degree of association with AST, both according to the ranking of feature importance (SHAP, correlation) and according to the results of the mediator analysis itself. Among them were ferritin, gamma-GT, LDH, urea, glucose, height, and physical activity. Thirdly, mediators with no significant effect were excluded from the final table: features for which all paths were insignificant (p > 0.05) or made a minimal contribution were not included to avoid excessive detailing.
4. Discussion
In this study, a comprehensive analysis of the factors determining aspartate aminotransferase (AST) activity was performed using modern machine learning methods, mediator analysis, and assessment of stable correlations. The results obtained enable a thorough characterization of biochemical and demographic predictors, their individual and joint contributions to AST variability, and the validation of the stability of the constructed models, including key AST predictors and their biological significance. The feature importance results for various algorithms (XGBoost, SHAP, RF SHAP, and Mutual Info) consistently indicate a group of biochemical markers and metabolic parameters as the primary determinants of AST activity. The most pronounced effect was demonstrated by gamma-glutamyl transferase (LBXSGTSI), ferritin (LBXFER), height (BMXHT), body weight (BMXWT), as well as LDH (LBXSLDSI) and urea (LBXSBU) activity parameters. These data are confirmed by both the generalized importance assessments and the analysis of feature interactions (SHAP interactions), where the most significant pairs included LBXSGTSI, BMXWT, and LBXSLDSI. Such an important structure is consistent with the biological role of AST as an integral enzymatic marker of hepatocyte damage. It reflects the contribution of the liver and metabolic pathways to the variability of its level. The presence of anthropometric parameters (height, body weight) among the top predictors highlights the importance of considering physical parameters when interpreting AST. It demonstrates the advantage of multivariate analysis, particularly in terms of correlation structure. The results of the stable correlation analysis (Spearman, Kendall, DistanceCorr, MICe) confirm that the strongest positive associations with AST are found for LBXSGTSI, LBXFER, LBXSLDSI, LBXSBU, and several biochemical metabolites (LBXGLU, LBXHGB). At the same time, negative and low correlation links are characteristic of demographic characteristics (RIAGENDR, RIDAGEYR), which indicates the predominance of biochemical factors over socio-demographic ones.
Mediator analysis. Mediator analysis revealed that ferritin (LBXFER) influences AST predominantly via the direct pathway, consistent with its function as an indicator of hepatocyte cytolysis and iron stores. For several other predictors (LBXSGTSI, BMXHT, LBXSLDSI, LBXGLU, LBXSBU), statistically significant direct and indirect effects were observed, indicating multifaceted and indirect pathways of AST level regulation. A detailed mediator analysis table covers all tested features, allowing one to verify the absence of false positive effects. None of the nonspecific or demographic mediators showed a significant total or indirect effect, confirming the specificity of the identified associations.
Quality of forecast models. A comparison of models using R², RMSE, MAE, MAPE, and Explained Variance metrics demonstrated that XGBoost and the Stacking_v2 ensemble model provide the highest accuracy and stability of predictions on both the training and validation sets, with minimal signs of overfitting. Simpler models (Linear Regression, Random Forest) are significantly inferior in accuracy and stability. The weakest results were recorded for the Stacking_v3 version, which is confirmed by an increase in errors during validation. ROC and Precision-Recall curves for the task of classifying increased AST (≥ 40 U/L) confirm excellent diagnostic properties, with AUC = 1.0 and AP = 1.0, indicating the absolute sensitivity and specificity of the model in this set. Additional calibration (LGBM) demonstrates the correspondence between the predicted and observed values across the entire quantile scale, thereby excluding systematic errors and biases.
Assessment of stability and transferability. The results of cluster and dendrogram analysis indicate a clear grouping of key predictors along a single functional axis, emphasizing their integration into the overall biological network. The absence of significant associations among random or low-specific features indicates the stability of the results to random variations and overfitting.
Limitations. Only available laboratory and anthropometric data were used in this study. Some weak correlations or nonspecific indirect effects noted in the mediator analysis did not reach statistical significance, which speaks in favor of the rigor of the findings. External validation on independent cohort samples is recommended for further generalization of the results