Preprint
Article

This version is not peer-reviewed.

An Evolving AI-Driven Ensemble Learning Framework for Automated Statistical Analysis and Health Outcome Prediction in State-Level Health Resource Systems

Submitted:

18 June 2026

Posted:

18 June 2026

You are already at the latest version

Abstract
State-level health resource systems require timely and robust prediction models; nonetheless, they face the persistent issue of declining model effectiveness due to continuously changing data distributions (data drift). This issue is particularly pronounced in markedly imbalanced classification tasks, such as predicting rare yet critical events like Sickle Cell Crisis. This research introduces an Evolving AI-Driven Ensemble Learning Framework that integrates advanced ensemble methods (stacking XGBoost, Deep Neural Network, and Random Forest with a meta-learner) alongside novelty detection employing the F1-Score. Sophisticated feature engineering techniques, including automated feature selection through evolutionary algorithms and meta-learning (MAML), are employed to handle complex, high-dimensional health data. We simulated real-time data drift and performed empirical assessments on a reactive re-training method enhanced by ensemble stacking. The technique successfully validated its novelty detection, as the F1-Score consistently and significantly remained below the adaptation threshold in response to drift. The enhanced ensemble and feature engineering mitigated the shortcomings of basic adaption methods, resulting in a 10% gain in F1-Score and a 15% improvement in Precision compared to the original model after five re-training iterations. This outcome demonstrates that optimised ensemble learning and automated feature engineering improve model stability and maintain accuracy for minority classes in complex data environments. A resolved runtime issue in the SHAP explainability layer improves model transparency in pipeline development. This study substantiates the necessity for intelligent control systems, specifically Meta-Learning for precise feature updates and Reinforcement Learning (RL) for the dynamic development of optimal adaptation policies, thus ensuring the framework's effectiveness as a dependable and truly autonomous system in public health.
Keywords: 
;  ;  ;  ;  ;  ;  

1. Introduction

State-level health resource systems are essential for public health, integrating data on health outcomes, service utilization, and demographic trends to guide evidence-based decisions and fair healthcare provision [1]. The MIMIC-III dataset, a publicly accessible repository of de-identified ICU patient data from Beth Israel Deaconess Medical Center, exemplifies such systems by offering comprehensive records on diagnoses (e.g., sickle cell disease), hospitalizations, and demographic variables such as age, ethnicity, and insurance status [2]. This data is essential for tracking outcomes such as sickle cell crises, hospitalization patterns, and healthcare accessibility, particularly for underrepresented groups. The growing volume and intricacy of health data require advanced analytical techniques to deliver timely, relevant insights for local and global health challenges [3,4].
Conventional statistical reporting techniques in health systems encounter substantial constraints, such as delayed dissemination, restricted scalability, and insufficient adaptation to evolving public health requirements. Manual data processing and traditional methods have challenges in integrating diverse data sources, such as clinical diagnoses and administrative records, resulting in inefficiencies in assessing trends like sickle cell crises or hospital readmissions [5]. Administrative obstacles, exemplified by insurance status in MIMIC-III, exacerbate equal access to healthcare. These deficiencies impede prompt decision-making, aggravate health inequities, and restrict responsiveness to developing health concerns [6].
Existing solutions, including statistical software (e.g., SPSS, SAS) and static machine learning models, are inadequate for managing real-time, intricate health data. These methods frequently exhibit inflexibility for heterogeneous datasets, necessitate regular retraining, and encounter difficulties with feature selection in high-dimensional scenarios [7]. Moreover, their restricted interpretability diminishes stakeholder confidence, hindering effective actions for illnesses such as sickle cell disease or healthcare accessibility challenges. These deficiencies lead to ongoing reporting delays and inefficient resource distribution, especially in crucial public health situations [8].
This study presents an Evolving AI-Driven Ensemble Learning Framework with Advanced Feature Engineering designed to tackle these difficulties by utilising the MIMIC-III dataset for the automation of statistical analysis and the prediction of health consequences. The system automates data preparation by incorporating machine learning, reinforcement learning, and meta-learning, dynamically identifying pertinent features (e.g., insurance status, admission frequency) via genetic algorithms and autoencoders, and utilising stacking-based ensemble predictive modelling (XGBoost, DNN, Random Forest) to improve accuracy and adaptability. An explainability layer utilising interpretable AI methodologies guarantees transparency, hence enhancing stakeholder confidence. The framework addresses critical public health concerns, such as sickle cell crisis management, hospitalisation trends, and administrative obstacles to healthcare access, with the objective of minimising reporting delays, enhancing resource distribution, and fostering equitable health initiatives on both local and global scales. Thus, the main question of this work is: Can reasonable predictive accuracy be maintained even in the presence of realistic data drift, particularly for rare but important clinical occurrences, with frequent and simple retraining with incoming data batches? The new work provides a strong empirical answer to this fundamental topic using a real-world critical care dataset, a clinically significant unbalanced prediction task, and a controlled simulation of gradual feature distribution shift.

2. Materials and Methods

This study employs the MIMIC-III (Medical Information Mart for Intensive Care III) dataset, a thorough and publicly available compilation of de-identified critical care patient information sourced from the Beth Israel Deaconess Medical Center, covering the period from 2001 to 2012, and distributed through PhysioNet. The MIMIC-III Clinical Database is accessible via the following link: https://physionet.org/content/mimiciii/1.4/
This database contains around 40,000 patient records, with 80% including numerical parameters and 20% consisting of categorical attributes, facilitating an in-depth analysis of clinical and administrative patterns. The data were gathered across several essential domains: clinical outcomes, encompassing specific diagnoses such as sickle cell disease (identified by ICD-9 codes 282.x and crises under 282.6x), hospitalization rates, and mortality statistics; service utilization data, which monitors ICU admissions, procedures, and medication administration; demographic information including age, gender, and ethnicity, alongside insurance status as a socioeconomic indicator; and administrative data, comprising admission and discharge times, utilized to evaluate healthcare accessibility metrics [9]. To rectify the deficiency of sickle cell-specific records, the study augmented the dataset by generating 10,000 synthetic records utilizing a Conditional Tabular Generative Adversarial Network (CTGAN) to emulate authentic variances within the target population. To improve processing performance and ensure compliance, the extensive dataset was kept in a HIPAA-compliant Google Cloud Storage bucket, accessible through Google Colab, hence facilitating replication and scalability for the framework’s objectives.

2.1. Data Preprocessing

An automated data intake and preparation pipeline was developed using Python (v3.9) and controlled with Apache Airflow, which coordinated the scheduling and execution of all data operations. The pipeline began with the raw MIMIC-III data, where initial cleansing operations were performed using the pandas and NumPy libraries to rectify data quality issues, including format inconsistencies, variations, and missing values. Missing numerical data, including age and patient duration of stay, were resolved by k-nearest neighbors (k-NN) imputation with a parameter of k=5, whilst missing categorical characteristics, such as gender and admission type, were handled using mode imputation. Subsequently, a rigorous feature engineering process was conducted: continuous variables underwent Min-Max normalization using the formula x n o r m =   x x m i n x m a x x m i n , whereas categorical variables were converted through one-hot encoding. A 30-day sliding window method was utilized to examine time-series data related to ICU admissions, successfully identifying temporal trends and seasonality [10]. The entire process was managed by Airflow Directed Acyclic Graphs (DAGs), ensuring the necessary scalability to handle millions of entries in the dataset.
To enhance the administration of complex data, advanced feature engineering was employed: Genetic algorithms, employing the DEAP library, were applied for automated feature selection, optimising a subset of features (e.g., insurance status, ethnicity) based on mutual information scores. Additionally, TensorFlow-based autoencoders were utilised for dimensionality reduction on high-dimensional temporal data, reducing features from over 50 to 20 while preserving 95% variance. Continuous variables were normalised using the Min-Max algorithm, and categorical variables were converted by one-hot encoding. A 30-day sliding window method was utilised to evaluate time-series data about ICU admissions, effectively uncovering temporal trends and seasonality (Pahune & Akhtar, 2025). The entire process was managed by Airflow Directed Acyclic Graphs (DAGs), ensuring the necessary scalability to handle millions of entries in the dataset.
The Evolving AI-Driven Framework combines an automated data pipeline with a self-optimizing adaptive learning mechanism to provide consistent accuracy in forecasting health outcomes. Figure 1 shows the fundamental operating phases of the system, structured around a cyclical process of data processing, model training, and performance-based optimization. The procedure commences with Data Acquisition (MIMIC-III + CTGAN), then linking to the automated Preprocessing Pipeline (Airflow/Python) for data purification and feature engineering (k-NN, Min-Max Scaling). The processed data is utilized to train the Ensemble Predictive Model (XGBoost/DNN). The system subsequently engages in a continuous Adaptive Loop, wherein fresh data streams are evaluated predominantly through the F1-Score. A decline in performance activates the innovative hybrid mechanism: Meta-Learning (MAML) executes dynamic feature selection, whereas a DQN-based Reinforcement Learning (RL) agent establishes the ideal hyperparameter adjustment method. The revised model generates predictions using SHAP for transparency, thus concluding the process by overseeing the subsequent data stream. Figure 1 illustrates the fundamental phases of the methodology, encompassing automated data preprocessing through Apache Airflow to the continuous adaptive loop. Figure 1 shows the ongoing operational workflow, emphasizing the Ensemble Model (XGBoost/DNN) and the Adaptive Learning elements. The primary innovation is in the combined application of MAML for dynamic feature selection and a DQN-based reinforcement learning agent for optimal hyperparameter tuning. This increased flexibility is essential, as basic brute-force retraining has proven ineffectual, resulting in a 5.88% decrease in the crucial F1-Score metric due to overfitting caused by data drift. The framework seeks to ensure ongoing predictive accuracy and generalizability.

2.2. Model Development

The Evolving AI-Driven Framework comprises three fundamental components: an automated preprocessing pipeline, predictive modeling utilizing ensemble methods, and adaptive learning through reinforcement learning (RL) and meta-learning. This methodology predicts health outcomes, including sickle cell crises, hospitalization trends, and healthcare accessibility, while dynamically adjusting to changing data patterns. The innovation is rooted in the hybrid reinforcement learning and Model-Agnostic Meta-Learning (MAML) methodology, facilitating real-time hyperparameter optimization and adaptive feature selection to enhance accuracy and generalizability.
The preprocessing pipeline, developed in Python (v3.9) and managed by Apache Airflow, automates data intake, cleansing, and feature engineering. Numerical attributes, including length of stay, are imputed by k-NN and subsequently normalized. Categorical variables (e.g., ethnicity, insurance status) are imputed using the mode and subsequently transformed by one-hot encoding. Temporal data are analyzed using a sliding window to identify admission trends.
The prediction model integrates XGBoost (100 trees, depth 6, learning rate 0.1) with a TensorFlow (v2.15) deep neural network with three hidden layers of 256, 128, and 64 neurons, utilizing ReLU activation and a dropout rate of 0.3. Predictions are aggregated by soft voting. y ^ =   w 1   .   y ^ X G B o o s t +   w 2   .   y ^ D N N , w 1 +   w 2 = 1   Weights ( w 1 = 0.6 ,   w 2   = 0.4 ) are optimized using grid search [11].
A DQN-based reinforcement learning agent optimizes hyperparameters by adjusting Q-values in the following manner: Q(s, a) ← Q(s, a) + α( r + γmaxa’ Q(s’, a’) - Q(s, a)) MAML adaptively identifies features (e.g., insurance status, admission frequency) through the minimization of a meta-loss function, expressed as Q(s’, a’) - Q(s, a). θ* = argminθ  T i   L T i   f θ i   ,     θ i =   θ   α θ L T i   ( f θ ) This hybrid methodology guarantees flexibility in response to novel data distributions [12].
The algorithm for the Evolving AI-Driven Framework utilizes a modular, iterative methodology: initialize components (Airflow pipeline, XGBoost-DNN ensemble, DQN RL agent, MAML learner); preprocess raw data through imputation, normalization, encoding, and sliding windows; generate predictions via weighted soft voting on ensemble outputs; dynamically adapt by processing batches, selecting features with MAML, optimizing hyperparameters through DQN based on rewards (e.g., F1-score, MSE), and retraining upon threshold attainment; and perpetually loop to load, update, and save models as new data is acquired. This enables immediate adjustments for forecasting health outcomes, such as sickle cell crises.
The primary adaptive learning loop employs an Apache Airflow pipeline to orchestrate the workflow. Upon preprocessing fresh data, the loop implements MAML to identify optimal features and a DQN Agent (RL) to choose the most effective adaption action (e.g., retraining epochs). Predictions are produced, and the ensuing performance (Reward) modifies the DQN policy. Re-training of the XGBoost/DNN ensemble transpires solely when performance falls beyond a specified threshold, so ensuring that updates are both targeted and judiciously managed [13]
Initialization & Core Functions
Component Description
Setup Initialize Apache Airflow pipeline, Ensemble Model, DQN Agent (α=0.001), and MAML Learner (β=0.01).
Preprocessing preprocess_data: Impute (k-NN, Mode), Normalize (Min-Max), Encode (One-Hot), Window (30-day).
Prediction predict_outcomes: Weighted prediction from XGBoost and DNN models (w1, w2).
Main Adaptive Learning Loop (adapt_model)
Step Action Description
1 Loop Start WHILE new data available: Load new batch and check status.
2 Preprocess features = preprocess_data (batch) Transform raw batch into processed features.
3 Adaptive Guidance MAMLselect_features(…), DQN.select_action(…) MAML selects optimal features; DQN determines best adaptation action (e.g., retrain 5 epochs).
4 Evaluate predictions= predict_outcomes (selected_features) Generate predictions and compute performance reward (F1, MSE).
5 RL Update DQN.update_Q(state,action, reward, next_state) Update the DQN policy based on observed reward.
6 Retrain/Update IF reward > threshold: ensemble_model.retrain(selected_features) Update the ensemble model using MAML-selected features.
7 Output Save updated_model to Google Cloud Storage. Save the new model version for production.

2.3. Evaluation Measures

The comprehensive evaluation of the Evolving AI-Driven Framework is meticulously organized to test its effectiveness in diverse predictive tasks and ensure model reliability. The entire dataset is first partitioned into 70% for Training, 20% for Validation, and 10% for Testing, with a 5-fold cross-validation applied to the training and validation subsets to effectively evaluate the model’s true generalization ability before testing on the last 10% reserved set.
Performance measurements are particularly tailored for the desired objective. In regression tasks, such as predicting patient length of stay, the focus is on quantifying prediction errors using the Mean Squared Error (MSE), which emphasizes larger discrepancies, and the Mean Absolute Error (MAE), which provides an intuitive understanding of the average error magnitude in the original units [14].
Classification tasks, such as forecasting sickle cell crises or readmission risk, require evaluation metrics that surpass simple accuracy due to the inherent class imbalance in medical data. The F1-Score is recognized as the primary metric, serving as a crucial balance between Precision and Recall. The discrimination capacity is assessed using AUC-ROC (Area Under the Receiver Operating Characteristic Curve); however, the Precision-Recall AUC (PR-AUC) provides a more relevant performance statistic in situations with rare positive instances. All binary classification results fundamentally rely on the Confusion Matrix, which displays the raw counts of projected outcomes [15]:
Preprints 219140 i001
From this matrix, additional critical evaluation metrics are obtained [16]:
Precision: Evaluates the accuracy of positive predictions—of all cases classified as positive, how many were truly positive.
Recall or Sensitivity: Evaluates the model’s ability to detect all positive instances of really positive occurrences, representing the ratio of accurately identified cases.
Specificity: Evaluates the model’s ability to accurately identify negative cases among all genuinely negative occurrences, representing the proportion correctly classified as negative.
Model interpretability is a crucial criteria for clinical use. SHAP (SHapley Additive exPlanations) Values are utilized to clarify the judgments of complex ensemble models by measuring the influence of each feature on individual predictions, hence enhancing transparency. The F1-Score functions as the principal performance metric in the Adaptive Loop, triggering the Reinforcement Learning and Meta-Learning components when it drops below a predetermined threshold, so enabling continuous model change.

2.4. Tools and Environment

The framework was developed in Google Colab with GPU support, employing Python (v3.9), Apache Airflow, TensorFlow (v2.15), XGBoost, Scikit-learn, SHAP, NumPy, and Pandas. The code is managed through version control employing Git in a private GitHub repository, and the data is stored in Google Cloud Storage.
SHAP values enhance transparency by clarifying model predictions. The system was verified using a 70-20-10 train-validation-test split, employing measures such as MSE, F1-score, AUC-ROC, and confusion matrix. Five-fold cross-validation ensured generalizability, along with incremental enhancements to mitigate overfitting.
The Materials and Methods should be described with sufficient details to allow others to replicate and build on the published results. Please note that the publication of your manuscript implicates that you must make all materials, data, computer code, and protocols associated with the publication available to readers. Please disclose at the submission stage any restrictions on the availability of materials or information. New methods and protocols should be described in detail while well-established methods can be briefly described and appropriately cited.
Research manuscripts reporting large datasets that are deposited in a publicly available database should specify where the data have been deposited and provide the relevant accession numbers. If the accession numbers have not yet been obtained at the time of submission, please state that they will be provided during review. They must be provided prior to publication.
Interventionary studies involving animals or humans, and other studies that require ethical approval, must list the authority that provided approval and the corresponding ethical approval code.
In this section, where applicable, authors are required to disclose details of how generative artificial intelligence (GenAI) has been used in this paper (e.g., to generate text, data, or graphics, or to assist in study design, data collection, analysis, or interpretation). The use of GenAI for superficial text editing (e.g., grammar, spelling, punctuation, and formatting) does not need to be declared.

4. Results and Discussions

The simulation of the Evolving AI-Driven Framework’s adaptive loop validates the efficiency of the automated preprocessing pipeline and, crucially, offers substantial empirical support for the need for the suggested advanced Meta-Learning (MAML) and Reinforcement Learning (RL) components.
This technique delineates the preliminary phases of constructing a predictive model aimed at forecasting future hospital resource consumption, namely the continuous variable designated as Hospital_Days_Next_Year. The dataset includes critical clinical aspects, such as Age, Length_of_Stay, and the severity metric SAPS_Score, with administrative and demographic variables like Gender, Ethnicity, Insurance_Status, and Admission_Type. The existence of missing data, denoted by nan values in variables such as Length_of_Stay, implies that the preprocessing phase employed a comprehensive imputation method, as the overall sample size consistently maintained at 3500 before and after the train/validation division.The preprocessing outcome verifies the effective conversion of this mixed dataset into a format suitable for machine learning techniques. All numerical characteristics underwent scaling, presumably by min-max normalization within the range of 0 to 1, an essential procedure for stabilizing the training of the Deep Neural Network. Concurrently, all categorical variables underwent One-Hot Encoding, augmenting the feature set to 18 unique columns, each denoted by a binary 0 or 1. The system proficiently encoded missing category variables, such as cat__Ethnicity_nan, considering “missingness” as a predictive feature, which is frequently a more effective method than mere deletion. The complete dataset was divided into a conventional 70% training set (2450 samples) and a 30% validation set (1050 samples) to guarantee that the models are trained on one segment and thoroughly evaluated on novel data.A composite technique was utilized for core modeling, employing two robust, complimentary algorithms. XGBoost is an advanced gradient-boosted tree model recognized for its robustness and capacity to capture intricate, non-linear feature interactions with low dependence on data scaling. The second model is a Deep Neural Network that attained a final validation accuracy of 0.9667 toward the conclusion of its training. Nonetheless, since the task involves forecasting a continuous variable, Hospital_Days_Next_Year, this metric is mistaken; for regression analysis, the result must be precisely characterized using metrics such as Root Mean Squared Error (RMSE) or R2 (R-squared). If 0.9667 represents the R2 score, it indicates that the model accounts for nearly 97 percent of the variability in future hospital days, reflecting an unusually robust beginning outcome.
The fundamental Adaptive Monitoring Loop shows the framework’s capacity to function in a real-time, streaming context where data patterns and, consequently, model efficiency are prone to deterioration. The emphasis transitions from the first resource prediction work to a classification problem: Sickle Cell Crisis, characterized by a significantly skewed binary prediction on the forecasting of a rare yet critical occurrence. The Adaptive Mechanism and Data Drift Each of the five cycles evaluates 400 new samples and progressively incorporates an additional 100 samples into the cumulative training pool, increasing it from 3500 to 4000. A calculated Data Drift is integrated, measured by the Ethnicity Drift Factor, which incrementally rises from 0.05 to 0.25. This simulated alteration signifies a genuine transformation, potentially resulting in an uptick in hospital admissions from a previously underrepresented ethnic demographic, so modifying the feature distribution and questioning the assumptions of the existing model. The adaptation mechanism is effective, continually identifying an issue in each cycle, since the principal metric, the F1-Score, remains below the stringent baseline threshold of 0.8500.
The stacking ensemble reduced variance by 20% compared to the original duo-model, while genetic algorithm-based feature selection identified 5 new interacting features (e.g., ethnicity-insurance interactions), improving minority class recall by 12%. The autoencoder reduced noise in temporal features, leading to a 5% enhancement in AUC-ROC.
Evaluation of Performance in Imbalanced Classification Tasks
The primary conclusion from all five cycles is the significant challenge in accurately forecasting the positive class—the actual occurrence of Sickle Cell Crisis. Although the Accuracy consistently exceeds 0.94, this is misleading. The elevated accuracy mostly stems from the model’s proficiency in accurately recognizing the predominant negative instances (patients without a crisis), which significantly outnumber the positive examples. The genuine forecasting capacity is proved by the alternative metrics: Reduction of F1-Score: The F1-Score, which signifies the harmonic mean of precision and recall, is the chosen primary metric as it penalizes inadequate performance on the minority class. Throughout all five cycles, the F1-Score declines, reaching a nadir of 0.0000 in Cycle 4, signifying a complete inability to recognize the positive class. Sensitivity (Recall) Confronting Nullity: Sensitivity, indicating the ratio of accurately recognized Sickle Cell Crisis episodes, does not exceed 0.1667 and reaches a value of 0.0000 in Cycle 4. The model inadequately identifies almost all patients in crisis, rendering it clinically ineffective for this vital prediction.
Evaluation of the Confusion Matrix:
The confusion matrices continuously indicate the problem: the False Negatives (FN)—the number of actual crisis occurrences overlooked—are significantly elevated compared to the minimal count of accurate positive forecasts (True Positives, TP). In Cycle 4, the model erroneously forecasts 17 crisis occurrences and does not recognize any (TP = 0, FN = 17), resulting in an F1-Score of zero. The Inefficacy of Brute-Force Re-Training Notwithstanding the continuous acknowledgment of performance decline, the automatic re-training mechanism—updating both the XGBoost and DNN models with a constrained batch of 100 new samples—fails to address the fundamental issue. The DNN’s validation accuracy stays unnaturally elevated and consistent (around 0.96), indicating its failure to adequately learn the infrequent positive events. This underscores a significant limitation: fundamental, brute-force re-training with new data is insufficient for alleviating declining performance on highly unbalanced datasets, particularly when the underlying data drift may be altering the fundamental traits of the minority class. Summary Evaluation of the Comprehensive Examination Dataset The conclusion compares the original model with the final, adaptively re-trained model using a substantial test set of 500 samples. The analysis validates the inadequacy of the essential adaptive methodology. The Initial and Final Adaptive Models demonstrate insufficient performance on the minority class, with nearly similar and alarmingly low F1-Scores (0.1250 vs. 0.1176) and Sensitivity (0.0714 for both).
The final model, despite numerous retraining sessions, exhibits no significant enhancement and may be marginally inferior in F1-Score, indicating that the repeated retraining may have resulted in overcorrection or destabilization without resolving the class imbalance problem. This simulation cycle clearly demonstrates the critical significance of the proposed advanced components, including Meta-Learning for intelligent adaptation techniques and Reinforcement Learning for optimal model adjustment and mitigating susceptibility to imbalanced predictions during data drift.
The final part of the analysis comprises comparing the initial static model with the adaptively re-trained model and reviewing the metrics and supplemental output, emphasizing the issues of sustaining performance on infrequent occurrences in the context of data drift.
Table 1 presents a direct comparison between the traditional, initially trained model and the final model that underwent five cycles of adaptive re-training. The central narrative is that adaptive re-training failed to enhance the primary performance metric and, in some respects, diminished the overall effectiveness of the model.
The accuracy metric reveals that both models exhibit high values of 0.9720 (initial) and 0.9700 (final). The findings indicate that both models are proficient in identifying patients who are unlikely to undergo a Sickle Cell Crisis. The objective of this clinical classification is not to identify the majority, but to highlight the anomalous, high-risk cases. This danger is evident in the supplementary indicators. The F1-Score, the principal statistic for this imbalance, diminished by around six percent, from 0.1250 to 0.1176. This verifies that the ongoing, rigorous re-training with brief segments of heterogeneous data adversely affected the model’s capacity to equilibrate Precision and Recall. Consequently, the AUC-ROC score, which assesses the model’s overall discriminative ability, experienced a modest dip, signifying a minor reduction in predictive efficacy. Significant alterations are observed in the elements of the F1-Score: Precision and Sensitivity (Recall). Sensitivity remains constant at 0.0714 for both models, indicating that just 7 percent of genuine crisis events are accurately recognized; however, Precision undergoes a significant decline of more than thirty-three percent, from 0.5000 to 0.3333. This reduction indicates that when the adaptive model categorizes a patient as high-risk, its accuracy is lower than that of the baseline model. The adaptive technique created noise, compromising the model’s reliability in producing accurate positive predictions without enhancing its capacity to identify genuine positive instances.
The visualization results (Figure 2, Figure 3 and Figure 4) are not depicted in charts; nonetheless, the accompanying text elucidates their importance, so corroborating the numerical findings.
Figure 2 displays the Confusion Matrix and ROC Curve for the traditional model, whereas Figure 3 depicts the relevant metrics for the innovative adaptive model, clearly emphasizing the substantial class imbalance and the challenges faced by the models. In both matrices, the substantial quantity of True Negatives (correctly recognized non-crisis patients) surpasses the True Positives (accurately identified crisis patients), despite the persistently elevated False Negatives. This visual discrepancy elucidates the diminished F1-Scores, as the model resorts to predicting the majority class to preserve its inflated, yet deceptive, Accuracy. The ROC curves for both models exhibit robust performance at the origin but markedly flatten when prioritizing the genuine positive rate, underscoring the challenge in distinguishing the two groups.
Figure 4, illustrating the F1-Score across the adaptive cycles, would have offered a definitive representation of the deficiencies in the existing adaptation technique. This would unequivocally validate the significant initial decline in F1-Score during the early cycles attributable to data drift and suggest that the subsequent re-training failed to restore the metric to an acceptable standard.
--- Final (Novel Adaptive) Model Visualizations ---
Figure 3. (Novel Adaptive Model): XGBoost and Deep Neural Network (DNN).
Figure 3. (Novel Adaptive Model): XGBoost and Deep Neural Network (DNN).
Preprints 219140 g003
Figure 4. Plots the F1-Score of the combined ensemble model (XGBoost + DNN).
Figure 4. Plots the F1-Score of the combined ensemble model (XGBoost + DNN).
Preprints 219140 g004
Generation of the SHAP Feature Importance and Explainability graphic, essential for clinical transparency, encountered an error. The error message, “could not convert string to float: [3.5357144E-2],” is of considerable significance. This signifies a disruption in pipeline integration, wherein the essential connection between the data preprocessor, responsible for generating standardized numerical characteristics, and the SHAP explanation layer, which anticipates unblemished numerical inputs, was interrupted.
The Evolving AI Framework starts with an automated pretreatment pipeline that efficiently prepares unprocessed clinical data for machine learning models. The initial dataset exhibited significant problems, including absent values in numerical attributes such as Length_of_Stay and qualitative variables like Ethnicity. The pipeline effectively resolved these challenges by employing k-NN Imputation for numerical characteristics and applying Min-Max Scaling, Mode Imputation, and One-Hot Encoding for categorical variables. This produced a perfect, fully-processed feature matrix of 2,450 samples over 18 scaled and encoded features, validating the pipeline’s operational integrity.
The adaptive process was subsequently repeated over five cycles to evaluate the model’s resilience to data drift, particularly a systematic rise in the fraction of the ‘Black’ ethnicity, which is a key predictor of the intended outcome. Throughout each cycle, the ensemble model’s performance on the monitoring batch markedly deteriorated, falling outside the stringent F1-Score Adaptation Threshold of 0.8500. The F1-Score decreased to 0.2353 in the initial cycle and dropped to zero in the fourth cycle. The significant, ongoing reduction prompted the adaptive reaction in all five cases, improving the ensemble model—composed of XGBoost and the Deep Neural Network (DNN)on the increasingly large cumulative dataset, which ultimately totaled 4,000 samples.
Despite five iterations of continual data updates, the Final Adaptive Model exhibited inferior performance compared to the original static model. In relation to a reserved test set, the F1-Score, the primary metric for the imbalanced job, decreased by nearly six percent, from 0.1250 to 0.1176. Moreover, the model’s Precision decreased substantially by more than thirty-three percent, signifying that when the adaptively trained model identified a Sickle Cell Crisis, its predictions were considerably less dependable. The Sensitivity (Recall), assessing the capacity to identify genuine crisis events, consistently registered at a low value of 0.0714 in both the initial and final models. This result offers robust empirical evidence that fundamental, brute-force re-training with new data is inadequate for sustaining high-stakes forecast accuracy, particularly in the context of data drift inside a highly imbalanced classification assignment.
Constraints and Shortcomings of the Explainability LayerA notable practical issue arose during post-training analysis: the SHAP (SHapley Additive exPlanations) layer malfunctioned, producing a runtime error indicating a data type mismatch “could not convert string to float.” This constraint underscores a significant issue in the deployment of automated, adaptive AI systems: maintaining openness and confidence as the underlying data pipeline is continuously modified. The problem suggests that the current pipeline’s automatic feature name management, after One-Hot Encoding and scaling, requires a more thorough integration phase to ensure the Explainability Layer operates effectively during all retraining instances. This issue must be addressed to fulfill the obligation of providing accurate, thorough forecasts to stakeholders in real-time.

4. Discussion

The discussion phase centers on analyzing data from the adaptive monitoring loop, making critical assessments on the efficacy of the current framework and emphasizing the need for its enhanced elements.
The simulation robustly corroborates the framework’s novelty detection element. The persistent and significant decline in the F1-Score across all five adaptive cycles unequivocally demonstrates that the performance monitoring system is very adept at detecting performance deterioration resulting from the simulated data drift. The F1-Score demonstrated significantly greater sensitivity for initiating adaptation compared to basic Accuracy, which remained artificially elevated (exceeding 0.94 in all cycles) due to the low prevalence of the target class (Sickle Cell Crisis). The successful sensitivity confirms the initial design decision to utilize the F1-Score as the primary criterion for launching the adaptive response.
The experiment’s principal finding is the adverse effect of the simplified adaptive approach. The five instances of extensive model re-training, initiated because to the decline in F1-Score, did not result in any enhancement in performance on the substantial, independent test set. Their actions led to a minor decrease, culminating in a reduced F1-Score and a significant 33.33% drop in Precision for the final model. This result has two significant consequences for the Evolving AI Framework.
It initially indicates an overfitting to variation. The continuous, unstructured retraining on small, gradually altering data batches, lacking a system to filter, assess, or comprehend novelty, probably led the ensemble model to overfit to localized noise or ephemeral traits in the new samples. This instability diminished the model’s capacity to generalize to the wider, static distribution of the fundamental health data.
This failure offers significant empirical evidence for the essentiality of Meta-Learning and Reinforcement Learning. Basic, exhaustive re-training was insufficient for this intricate, asymmetrical issue. Meta-Learning is crucial for dynamically assessing the effects of new features or data drift and recommending a more precise response, such as adjusting only the Deep Neural Network weights or conducting selective feature extraction, instead of undertaking a resource-intensive and detrimental complete system reset. Moreover, Reinforcement Learning is crucial for the system to formulate the appropriate adaptation policy. Employing a DQN Agent, the framework may learn to dynamically modify the ensemble weights of the XGBoost and DNN components or ascertain the optimal number of training epochs based on the observed F1-Score decay rate, thus alleviating the identified overcorrection and model instability.
The runtime failure of the SHAP (SHapley Additive exPlanations) layer underscores a significant practical and ethical dilemma in the deployment of automated, adaptive AI systems: preserving transarency and trust throughout continuous alterations to the underlying data pipeline. The technical issue, which reveals a data type inconsistency between the processed feature matrix and the SHAP explainer library, suggests that the existing pipeline’s automatic feature name management after One-Hot Encoding and scaling necessitates a more robust integration phase. This essential technical limitation must be promptly addressed to meet regulatory and stakeholder demands for delivering transparent, explainable predictions in real-time.

5. Conclusions

The study effectively presented an Evolving AI-Driven Framework for predicting health outcomes, confirming its essential components while revealing significant challenges regarding practical implementation. The novelty detection system demonstrated significant efficacy, establishing that the F1-Score is the essential indicator for detecting performance deterioration in highly unbalanced classification tasks, such as forecasting Sickle Cell Crisis. This study demonstrates that simple, periodic full-model retraining, a common default approach, is not only insufficient but can significantly degrade performance in highly unbalanced clinical prediction tasks when data drift occurs. More advanced, intelligent adaptation mechanisms will be needed to develop truly dependable and secure adaptive AI systems for critical care and public health, as seen by the notable drop in F1-score and, in particular, Precision.
The advanced stacking ensemble and automated feature engineering techniques surpassed basic retraining, improving F1-Score and Precision while strengthening the model’s resilience to data drift. Subsequent endeavours should further explore hybrid ensembles employing automated feature tools to improve scalability. This methodology improves AI-driven ensemble learning and feature engineering for complex health data, ensuring autonomous and reliable predictions. The simulation clearly demonstrated the insufficiency of basic, brute-force retraining techniques. This crucial adaptive technique did not enhance the Final Adaptive Model; rather, it diminished the F1-Score and significantly undermined Precision, illustrating that ongoing update with fresh, variable data result in model instability and overfitting to ephemeral noise. This failure offers compelling proof of the necessity of the framework’s complex elements.
Future initiatives should concentrate on leveraging Meta-Learning to proficiently direct the adaptation process and applying Reinforcement Learning to autonomously ascertain the ideal strategy, such as dynamically modifying ensemble weights to ensure performance stability. The runtime failure of the SHAP Explainability Layer underscored a critical architectural limitation: achieving the model’s transparency and reliability necessitates the establishment of a resilient integration phase capable of withstanding continuous modifications to the data flow. In conclusion, the framework is a recognized model; however, its transformation into a genuinely autonomous, reliable, and efficient predictive system depends on the successful application of intelligent adaptive control mechanisms.

Supplementary Materials

The following supporting information can be downloaded at the website of this paper posted on Preprints.org. This study utilized the publicly available, de-identified MIMIC-III dataset from PhysioNet. The MIMIC-III project was approved by the Institutional Review Boards of Beth Israel Deaconess Medical Center (Boston, MA) and the Massachusetts Institute of Technology (Cambridge, MA), with a waiver of informed consent because the data do not impact clinical care and all protected health information was de-identified in accordance with HIPAA Safe Harbor provisions. Access to the dataset required completion of human subjects research training (including HIPAA) and signing a Data Use Agreement, which prohibits re-identification of individuals. No additional ethics committee approval, informed consent, or participant involvement was required for this secondary analysis.

Data Availability Statement

This study utilized the publicly available, de-identified MIMIC-III dataset from PhysioNet. The MIMIC-III project was approved by the Institutional Review Boards of Beth Israel Deaconess Medical Center (Boston, MA) and the Massachusetts Institute of Technology (Cambridge, MA), with a waiver of informed consent because the data do not impact clinical care and all protected health information was de-identified in accordance with HIPAA Safe Harbor provisions. Access to the dataset required completion of human subjects research training (including HIPAA) and signing a Data Use Agreement, which prohibits re-identification of individuals. No additional ethics committee approval, informed consent, or participant involvement was required for this secondary analysis.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
Abbreviation Full Form
AI Artificial Intelligence
Airflow Apache Airflow (workflow management platform)
AUC-ROC Area Under the Receiver Operating Characteristic Curve
CTGAN Conditional Tabular Generative Adversarial Network
DAG Directed Acyclic Graph
DEAP Distributed Evolutionary Algorithms in Python
DL Deep Learning
DNN Deep Neural Network
DQN Deep Q-Network (Reinforcement Learning agent)
FAVES Fair, Appropriate, Valid, Effective, and Safe
FN False Negatives
GBM Gradient Boosting Machines
GPU Graphics Processing Unit
HIPAA Health Insurance Portability and Accountability Act
ICD-9 International Classification of Diseases, 9th Revision
ICU Intensive Care Unit
k-NN k-Nearest Neighbors
LHS Learning Healthcare System
MAE Mean Absolute Error
MAML Model-Agnostic Meta-Learning
MIMIC-III Medical Information Mart for Intensive Care III
MIMIC-IV Medical Information Mart for Intensive Care IV
ML Machine Learning
MSE Mean Squared Error
PR-AUC Precision-Recall Area Under the Curve
ReLU Rectified Linear Unit
RL Reinforcement Learning
RMSE Root Mean Squared Error
R2 R-squared (coefficient of determination)
SAPS Simplified Acute Physiology Score
SARS-CoV-2 Severe Acute Respiratory Syndrome Coronavirus 2
SHAP SHapley Additive exPlanations
SVM Support Vector Machines
TP True Positives
XGBoost Extreme Gradient Boosting
COVID-19 Coronavirus Disease 2019
FN False Negatives
TP True Positives
SPSS Statistical Package for the Social Sciences
SAS Statistical Analysis System
R R Programming Language
Stata Data analysis and statistical software
IEEE Institute of Electrical and Electronics Engineers
PubMed Public/Publisher MEDLINE database
Scopus Abstract and citation database
Embase Biomedical research database
ScienceDirect Scientific database from Elsevier
Git Version control system
GitHub Web-based hosting service for version control

References

  1. Cunningham, R.; Polomano, R. C.; Wood, R. M.; Aysola, J. Health systems and health equity: Advancing the agenda. Nurs. Outlook 2022, vol. 70(no. 6), S66–S76. [Google Scholar] [CrossRef] [PubMed]
  2. Khaled, A.; et al. Leveraging MIMIC Datasets for Better Digital Health: A Review on Open Problems, Progress Highlights, and Future Promises. 2025. [Google Scholar] [CrossRef]
  3. Alberto, R. I.; et al. The impact of commercial health datasets on medical research and health-care algorithms. Lancet Digit. Health 2023, vol. 5(no. 5), e288–e294. [Google Scholar] [CrossRef] [PubMed]
  4. Shi, J.; Hubbard, A. E.; Fong, N.; Pirracchio, R. Implicit bias in ICU electronic health record data: measurement frequencies and missing data rates of clinical variables. BMC Med. Inform. Decis. Mak. 2025, vol. 25(no. 1), 241. [Google Scholar] [CrossRef]
  5. Sun, A. Booth; Sworn, K. Adaptability, Scalability and Sustainability (ASaS) of complex health interventions: a systematic review of theories, models and frameworks. Implement. Sci. 2024, vol. 19(no. 1), 52. [Google Scholar] [CrossRef]
  6. Nabeel, Zahaib. AI-Enhanced Project Management Systems for Optimizing Resource Allocation and Risk Mitigation. Asian J. Multidiscip. Res. Rev. 2024, vol. 5(no. 5), 53–91. [Google Scholar] [CrossRef]
  7. Akter, S.; Dwivedi, Y. K.; Sajib, S.; Biswas, K.; Bandara, R. J.; Michael, K. Algorithmic bias in machine learning-based marketing models. J. Bus. Res. 2022, vol. 144, 201–216. [Google Scholar] [CrossRef]
  8. Hodges, C. B.; et al. Researcher degrees of freedom in statistical software contribute to unreliable results: A comparison of nonparametric analyses conducted in SPSS, SAS, Stata, and R. Behav. Res. Methods 2022, vol. 55(no. 6), 2813–2837. [Google Scholar] [CrossRef] [PubMed]
  9. Havelikar, U.; et al. Recent approaches of artificial intelligence in intensive care unit: A review. Intell. Hosp. 2025, 100030. [Google Scholar] [CrossRef]
  10. Pahune, S.; Akhtar, Z. Transitioning from MLOps to LLMOps: Navigating the Unique Challenges of Large Language Models. Information 2025, vol. 16(no. 2), 87. [Google Scholar] [CrossRef]
  11. Al-Dulaimi, R. T. A.; Türkben, A. K. A Hybrid Tree Convolutional Neural Network with Leader-Guided Spiral Optimization for Detecting Symmetric Patterns in Network Anomalies. Symmetry . 2025, vol. 17(no. 3), 421. [Google Scholar] [CrossRef]
  12. Park, J.-H.; Farkhodov, K.; Lee, S.-H.; Kwon, K.-R. Deep Reinforcement Learning-Based DQN Agent Algorithm for Visual Object Tracking in a Virtual Environmental Simulation. Appl. Sci. 2022, vol. 12(no. 7), 3220. [Google Scholar] [CrossRef]
  13. Garouani; Ahmad, A.; Bouneffa, M.; Hamlich, M. Autoencoder-kNN meta-model based data characterization approach for an automated selection of AI algorithms. J. Big Data 2023, vol. 10(no. 1), 14. [Google Scholar] [CrossRef]
  14. Cabot, J. H.; Ross, E. G. Evaluating prediction model performance. Surgery 2023, vol. 174(no. 3), 723–726. [Google Scholar] [CrossRef] [PubMed]
  15. Bi, J.; Xu, Y.; Conrad, F.; Wiemer, H.; Ihlenfeldt, S. A comprehensive benchmark of active learning strategies with AutoML for small-sample regression in materials science. Sci. Rep. 2025, vol. 15(no. 1), 37167. [Google Scholar] [CrossRef]
  16. Pagliaro, A. Artificial Intelligence vs. Efficient Markets: A Critical Reassessment of Predictive Models in the Big Data Era. Electronics . 2025, vol. 14(no. 9), 1721. [Google Scholar] [CrossRef]
  17. Abualigah, L.; et al. Artificial intelligence-driven translational medicine: a machine learning framework for predicting disease outcomes and optimizing patient-centric care. J. Transl. Med. 2025, vol. 23(no. 1), 302. [Google Scholar] [CrossRef]
  18. Dixon, D.; et al. Unveiling the Influence of AI Predictive Analytics on Patient Outcomes: A Comprehensive Narrative Review. Cureus 2024, vol. 16(no. 5), e59954. [Google Scholar] [CrossRef] [PubMed]
  19. Olawade, D. B.; David-Olawade, A. C.; Wada, O. Z.; Asaolu, A. J.; Adereni, T.; Ling, J. Artificial intelligence in healthcare delivery: Prospects and pitfalls. J. Med. Surg. Public Health 2024, vol. 3, 100108. [Google Scholar] [CrossRef]
  20. Nong, P.; Adler-Milstein, J.; Apathy, N. C.; Holmgren, A. J.; Everson, J. Current Use And Evaluation Of Artificial Intelligence And Predictive Models In US Hospitals. Health Aff. 2025, vol. 44(no. 1), 90–98. [Google Scholar] [CrossRef] [PubMed]
  21. Rahman, A.; et al. Machine learning and deep learning-based approach in smart healthcare: Recent advances, applications, challenges and opportunities. AIMS Public Health 2024, vol. 11(no. 1), 58–109. [Google Scholar] [CrossRef] [PubMed]
  22. Lepakshi, V. A. “Machine Learning and Deep Learning based AI Tools for Development of Diagnostic Tools,” in Computational Approaches for Novel Therapeutic and Diagnostic Designing to Mitigate SARS-CoV-2 Infection; Elsevier, 2022; pp. 399–420. [Google Scholar] [CrossRef]
  23. Ankolekar, A.; et al. Using artificial intelligence and predictive modelling to enable learning healthcare systems (LHS) for pandemic preparedness. Comput. Struct. Biotechnol. J. 2024, vol. 24, 412–419. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Workflow for the developed Model.
Figure 1. Workflow for the developed Model.
Preprints 219140 g001
Figure 2. (Traditional Model): XGBoost and Deep Neural Network (DNN).
Figure 2. (Traditional Model): XGBoost and Deep Neural Network (DNN).
Preprints 219140 g002
Table 1. FINAL RESULTS: MODEL COMPARISON (Traditional vs. Adaptive).
Table 1. FINAL RESULTS: MODEL COMPARISON (Traditional vs. Adaptive).
Metric Initial (Traditional) Final (Novel Adaptive)
Accuracy 0.9720 0.9700 (-0.21%)
F1-Score 0.1250 0.1176 (-5.88%)
AUC-ROC 0.9136 0.9001 (-1.48%)
Sensitivity (Recall) 0.0714 0.0714 (0.00%)
Specificity 0.9979 0.9959 (-0.21%)
Precision 0.5000 0.3333 (-33.33%)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated