This study aimed to develop a learning model using the Stacking technique to analyze key factors associated with credit risk. A total of 14 factors were considered using secondary data obtained from the Ministry of Industry. The dataset was balanced using four techniques: SMOTE, ADASYN, SMOTEENN, and SMOTETomek. The performance of nine predictive methods was compared: Decision Tree, Support Vector Machine (SVM), Gradient Boosting, K-Nearest Neighbors (KNN), Naïve Bayes, Improved Logistic Regression, Improved Gradient Boosting, Improved Extreme Gradient Boosting, and Multilayer Perceptron Neural Network. Model performance was evaluated using Accuracy, Precision, Recall, F-measure, and Area Under the ROC Curve (AUC).
These metrics were selected due to their ability to capture classification performance under class imbalance. Particularly, F1-score reflects a trade-off between Precision and Recall, while AUC evaluates the model’s ability to discriminate between the two classes across all thresholds.
Compared to the baseline accuracy of 0.734—achieved by predicting the majority class only—the stacking-based models showed substantial improvement in both detection of the minority class and overall balance of performance metrics.
3.1. Results of Stacking-Based Learning Models
The original dataset used in this study was imbalanced. The target variable had two classes: Class 0 (Non-NPL) with 742 instances and Class 1 (NPL) with 272 instances, showing a significant imbalance that could lead to model overclassification toward the majority class, as illustrated in
Figure 3.
To address the class imbalance issue, four resampling methods were applied: SMOTE, ADASYN, SMOTE+ENN, and SMOTE+Tomek. The outcomes are shown in
Table 6 and
Figure 4. After balancing, SMOTE+ENN not only increased the sample size for the minority class but also removed potentially confusing instances, which helped improve model performance. The effectiveness of each technique varies depending on the nature of the data and the objective of analysis.
The study implemented the Stacking technique to enhance model performance by combining base learners. These models were trained on resampled datasets using SMOTE, ADASYN, SMOTEENN, and SMOTETomek. Evaluation metrics included Accuracy, Precision, Recall, F-measure, and AUC.
The base models were selected to cover a variety of learning biases: Decision Tree for interpretability, SVM for margin-based classification, Gradient Boosting for capturing non-linear interactions, KNN for instance-based learning, and Naïve Bayes for probabilistic modeling.
The stacking-based meta-learners—META-LR, META-GB, META-XGB, and META-MLP—demonstrated improved performance across all metrics, with better balance between Recall and Precision compared to base models, as shown in
Table 7,
Table 8,
Table 9 and
Table 10.
Logistic Regression was selected as a baseline meta-model due to its interpretability and efficiency, while Gradient Boosting and XGBoost provide powerful ensemble capabilities with gradient optimization. The Multilayer Perceptron (MLP) was included to explore non-linear decision boundaries through deep learning. The outputs used for training the meta-model were probability estimates from each base model rather than discrete class labels. This approach enables the meta-learner to capture uncertainty and subtle differences in model predictions, leading to improved final classification.
All reported performance metrics represent the average values across five folds using Stratified K-Fold Cross-Validation to ensure robust and unbiased evaluation. The test set remained untouched during the cross-validation process and was used solely for the final evaluation of the trained ensemble model.
Under SMOTE, meta-models consistently outperformed traditional base models across all evaluation metrics. META-MLP and META-LR achieved the highest Accuracy (0.808 and 0.805) and AUC (0.890), with full feature sets providing a better balance between Precision and Recall. In contrast, stepwise selection enhanced performance in META-GB, improving Accuracy to 0.795 and AUC to 0.850. Among base models, stepwise selection notably improved Recall and AUC for Decision Tree, SVM, and Naïve Bayes. Despite lower Precision, these models demonstrated enhanced sensitivity. Overall, meta-models under SMOTE showed strong robustness, particularly when combined with full feature sets.
Under ADASYN, meta-models outperformed base models across all performance metrics. META-MLP and META-LR full models achieved the highest Accuracy (0.831 and 0.814), F-measure (0.837 and 0.814), and AUC (0.910 and 0.900), demonstrating a strong balance between Precision and Recall. Stepwise selection slightly improved META-GB’s Accuracy (0.777) and META-XGB’s AUC (0.830), though gains were marginal. Among base models, stepwise variants improved Recall and AUC, particularly for Naïve Bayes and SVM. However, Precision remained low across all base models. Overall, full models under ADASYN showed greater robustness, especially in meta-learning architectures.
SMOTEEN significantly enhanced model performance, especially for meta-learners. Stepwise variants of META-MLP, META-GB, and META-XGB achieved top-tier results, with META-MLP Stepwise recording the highest Accuracy (0.942), Recall (0.990), F-measure (0.953), and AUC (0.990). META-LR Full Model also performed well, with slightly lower but comparable metrics. Base models showed moderate gains in Recall and AUC using Stepwise, particularly for SVM and Naïve Bayes. However, they continued to lag in Precision and overall balance. Overall, SMOTEEN combined with Stepwise feature selection proved especially effective for meta-models.
SMOTETomek showed significant improvement in model performance, especially in meta-learners. META-LR Full Model achieved the highest Accuracy (0.852), Precision (0.833), Recall (0.878), F-measure (0.855), and AUC (0.920). Stepwise selection improved performance for META-GB and META-XGB, with the former showing better Accuracy (0.835) and Recall (0.859) than the Full Model. META-MLP showed balanced performance, with slight gains in Recall and F-measure in the Stepwise variant. Base models generally improved Recall with Stepwise, but still underperformed in Precision and AUC. In conclusion, SMOTETomek favored meta-models, particularly for boosting overall accuracy and recall while maintaining strong AUC values.
The comparison chart illustrates the performance of the best base model (Gradient Boosting) against various META models under the SMOTEENN balancing technique
Figure 4. The results highlight the superior predictive capability of the META-MLP model, which achieved the highest scores across most evaluation metrics. This demonstrates the effectiveness of Stacking in enhancing model performance, particularly in handling imbalanced data. The integration of multiple models through Stacking significantly improves accuracy and robustness in predicting credit risk.
These findings suggest that META-MLP has strong potential in identifying non-performing loans (NPLs) more accurately. However, an interesting question arises: could the model’s performance be further improved by incorporating additional data types, such as time-series payment behavior? Future research may explore this direction to expand the model's predictive depth and practical applicability, as shown in
Figure 5,
Figure 6,
Figure 7,
Figure 8 and
Figure 9.
3.2. Comparison of Model Performance Before and After Feature Selection
A comparison was made between the performance of models using all variables (Full Model) and those using variables selected through the Stepwise Selection method (Stepwise Model). The objective was to evaluate the impact of reducing the number of variables on the predictive capability for credit risk assessment. Both model groups were tested in conjunction with various data balancing techniques such as Random Over-Sampling, SMOTE, SMOTEENN, and SMOTETomek to enhance accuracy and mitigate bias caused by class imbalance.
The analysis results revealed that the Stepwise Models generally performed as well as or better than the Full Models in many cases, particularly when considering performance metrics such as Accuracy, Recall, F1-score, and AUC, which indicate the model’s ability to correctly classify data. A notable difference was that the Stepwise Models tended to yield higher F1-scores and AUC values than the Full Models, especially when used with the SMOTEENN technique, which enhances the quality of training data.
Moreover, the use of selected variables in the Stepwise Models helped eliminate unnecessary features, reduce the risk of multicollinearity, and lower model complexity without significantly compromising overall performance. This also facilitated clearer interpretation of the results and made the models more applicable in policy-making contexts. In summary, variable selection prior to model construction is an important step in improving model performance. It also contributes positively to real-world applicability in terms of accuracy, simplicity, and interpretability of the outcomes.
3.3. Comparison of Model Performance with Baseline Approach (Non-Model)
In this subsection, we compare the performance of all models against a baseline approach that predicts the dependent variable by always selecting the most frequent class (Non-NPL).
Table 11 presents the performance comparison between baseline approach and all models under the SMOTEENN resampling technique, using variables selected via Stepwise Selection. The baseline approach (mode classifier) achieved an Accuracy of 0.734 but was unable to provide meaningful values for Precision, Recall, F1-score, or AUC due to predicting a single class only.
Among the base models, Support Vector Machine and Naïve Bayes showed relatively high Recall (0.870 and 0.889, respectively) but had low Precision and F-measure, reflecting a high false positive rate. Gradient Boosting outperformed other base models in overall balance.
In contrast, all Stacking-based models (META-LR, META-GB, META-XGB, and META-MLP) significantly outperformed the base models and baseline approach across all metrics. META-MLP achieved the highest performance, with an Accuracy of 0.942, F-measure of 0.953, and AUC of 0.990, indicating a strong ability to distinguish between NPL and non-NPL cases effectively and consistently.