Submitted:
21 May 2026
Posted:
22 May 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Literature Review
3. Materials and Methods
3.1. Dataset Description
3.2. Attribute Description
3.3. Data Preparation
3.4. Algorithm Selection
3.5. Experimental Setup
3.6. Model Evaluation and Performance Metrics
- True Positive (TP) – the model correctly predicts the positive class.
- True Negative (TN) – the model correctly predicts the negative class.
- False Positive (FP) – the model predicts the positive class even though the actual value is negative (a Type I error).
- False Negative (FN) – the model predicts the negative class when the true value is positive (a Type II error).
- Area Under the ROC Curve (AUC),
- Accuracy,
- F1 score,
- Precision,
- Recall and
- Matthews Correlation Coefficient (MCC).
3.6.1. Area Under the ROC Curve (AUC)
3.6.2. Accuracy
3.6.3. F1 Score
3.6.4. Precision
3.6.5. Recall
3.6.6. Matthews Correlation Coefficient (MCC)
4. Results
4.1. Feature Importance Analysis
4.2. Results of the Models
4.3. Stability of the Models
4.4. Confusion Matrix
4.5. Statistical Analysis of the Models
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| AUC | Area Under the ROC Curve |
| FN | False Negative |
| FP | False Positive |
| GB | Gradient Boosting |
| LR | Logistic Regression |
| MCC | Matthews Correlation Coefficient |
| NN | Neural Network |
| RF | Random Forest |
| TN | True Negative |
| TP | True Positive |
References
- Sharma, S.; Panda, S.P.; Verma, S. Modeling and Predicting Student Enrolment: A Meta-Learning Perspective. Int. J. Inf. Technol. 2026. [Google Scholar] [CrossRef]
- Abiola-Adams, O.; Otokiti, B.O.; Olinmah, F.I.; Abutu, D.E.; Okoli, I.; Imohiosen, C. Building Performance Forecasting Models for University Enrollment Using Historical and Transfer Data Analytics. J. Front. Multidiscip. Res. 2021, 2, 162–168. [Google Scholar] [CrossRef]
- APatungan, A.J.; Francia, M.L.M. A Machine Learning Modeling Prediction of Enrollment among Admitted College Applicants at University of Santo Tomas. AIP Conf. Proc. 2022, 020004. [Google Scholar] [CrossRef]
- Soltys, M.; Dang, H.D.; Reilly, G.R.; Soltys, K. Enrollment Predictions with Machine Learning. Strateg. Enroll. Manag. Q. 2021, 9, 11–18. Available online: https://eric.ed.gov/?id=EJ1311204 (accessed on 3 February 2026).
- Yang, L.; Feng, L.; Zhang, L.; Tian, L. Predicting Freshmen Enrollment Based on Machine Learning. J. Supercomput. 2021, 77, 11853–11865. [Google Scholar] [CrossRef]
- Shao, L.; Ieong, M.; Levine, R.A.; Stronach, J.; Fan, J. Machine Learning Methods for Course Enrollment Prediction. Strateg. Enroll. Manag. Q. 2022, 10, 11–29. Available online: https://eric.ed.gov/?id=EJ1356234 (accessed on 4 February 2026).
- Ayasi, B.; Saleh, M.; García-Vico, A.M.; Carmona, C. Predicting Course Enrollment with Machine Learning and Neural Networks: A Comparative Study of Algorithms. In Studies on Social and Education Sciences 2023; Bilen, Ö., Shaaban, E., Eds.; ISTES Organization: Monument, CO, USA, 2023; pp. 157–182. Available online: https://www.aaup.edu/about-university/faculty-members/bahgat-waleed-deeb-ayasi/publications/predicting-course-enrollment (accessed on 4 February 2026).
- Brianorman, Y.; Sucipto, S. PREDICTION OF PROSPECTIVE NEW STUDENTS USING DECISION TREE, RANDOM FOREST, AND NAIVE BAYES. BAREKENG J. Ilmu Mat. Dan Terap. 2024, 18, 1433–1446. [Google Scholar] [CrossRef]
- Raftopoulos, G.; Davrazos, G.; Kotsiantis, S. Fair and Transparent Student Admission Prediction Using Machine Learning Models. Algorithms 2024, 17, 572. [Google Scholar] [CrossRef]
- Ramos, M.C. Machine Learning-Based Enrollment Prediction for a Higher Education Institution. Asia Pac. J. Manag. Sustain. Dev. 2024, 12, 15–29. [Google Scholar] [CrossRef]
- Yang, L.; Du, S.; Chen, Y.; Zhang, L. Research on Prediction of College Students’ Registration Based on Machine Learning and Voting Model. Membr. Technol. 2024, 5, 185–195. [Google Scholar] [CrossRef]
- He, S.; Yousefpoori-Naeim, M.; Cui, Y.; Cutumisu, M. Predicting College Enrollment for Low-Socioeconomic-Status Students Using Machine Learning Approaches. Big Data Cogn. Comput. 2025, 9, 99. [Google Scholar] [CrossRef]
- Dey, D.; Haque, Md.S.; Islam, Md.M.; Aishi, U.I.; Shammy, S.S.; Mayen, Md.S.A.; Noor, S.T.A.; Uddin, Md.J. The Proper Application of Logistic Regression Model in Complex Survey Data: A Systematic Review. BMC Med. Res. Methodol. 2025, 25, 15. [Google Scholar] [CrossRef]
- Margalina, V.-M.; Kreienbaum, C.; Hair, J.F.; Becker, J.-M.; Ringle, C.M. Multiple Linear and Logistic Regression Analysis: A SmartPLS 4 Software Tutorial. J. Mark. Anal. 2026. [Google Scholar] [CrossRef]
- O’Connell, N.S.; Jaeger, B.C.; Bullock, G.S.; Speiser, J.L. A Comparison of Random Forest Variable Selection Methods for Regression Modeling of Continuous Outcomes. Brief. Bioinform. 2025, 26, bbaf096. [Google Scholar] [CrossRef]
- Van Jaarsveld, B.; Hauswirth, S.M.; Wanders, N. Machine Learning and Global Vegetation: Random Forests for Downscaling and Gap Filling. Hydrol. Earth Syst. Sci. 2024, 28, 2357–2374. [Google Scholar] [CrossRef]
- Rizkallah, L.W. Enhancing the Performance of Gradient Boosting Trees on Regression Problems. J. Big Data 2025, 12, 35. [Google Scholar] [CrossRef]
- Gunasekara, N.; Pfahringer, B.; Gomes, H.; Bifet, A. Gradient Boosted Trees for Evolving Data Streams. Mach. Learn. 2024, 113, 3325–3352. [Google Scholar] [CrossRef]
- Viswanathan, G.; Samdani, G.; Dixit, Y.; Gopalan, R. Deep Learning. World J. Adv. Eng. Technol. Sci. 2025, 14, 512–527. [Google Scholar] [CrossRef]
- Eshraghian, J.K.; Ward, M.; Neftci, E.O.; Wang, X.; Lenz, G.; Dwivedi, G.; Bennamoun, M.; Jeong, D.S.; Lu, W.D. Training Spiking Neural Networks Using Lessons From Deep Learning. Proc. IEEE 2023, 111, 1016–1054. [Google Scholar] [CrossRef]
- Garouani, M.; Barhrhouj, A.; Teste, O. XStacking: An Effective and Inherently Explainable Framework for Stacked Ensemble Learning. Inf. Fusion 2025, 124, 103358. [Google Scholar] [CrossRef]
- Wu, W.; Tang, L.; Zhao, Z.; Teo, C.-P. Enhancing Binary Classification: A New Stacking Method via Leveraging Computational Geometry 2024. [CrossRef]
- Orange Data Mining. Orange Data Mining Toolbox. Available online: https://orangedatamining.com/ (accessed on 8 March 2026).
- Xu, J.; Liu, X.; Gu, Z.; Xiao, G. A Rapid Cross-Validation Computing for Three-Way Decisions in Imbalanced Data. Inf. Sci. 2025, 707, 122016. [Google Scholar] [CrossRef]
- Qiu, J. An Analysis of Model Evaluation with Cross-Validation: Techniques, Applications, and Recent Advances. Adv. Econ. Manag. Polit. Sci. 2024, 99, 69–72. [Google Scholar] [CrossRef]
- Zeng, G. Invariance Properties and Evaluation Metrics Derived from the Confusion Matrix in Multiclass Classification. Mathematics 2025, 13, 2609. [Google Scholar] [CrossRef]
- Sathyanarayanan, S. Confusion Matrix-Based Performance Evaluation Metrics. Afr. J. Biomed. Res. 2024, 4023–4031. [Google Scholar] [CrossRef]
- Rainio, O.; Teuho, J.; Klén, R. Evaluation Metrics and Statistical Tests for Machine Learning. Sci. Rep. 2024, 14, 6086. [Google Scholar] [CrossRef]
- Sujon, K.M.; Hassan, R.; Choi, K.; Samad, M.A. Accuracy, Precision, Recall, F1-Score, or MCC? Empirical Evidence from Advanced Statistics, ML, and XAI for Evaluating Business Predictive Models. J. Big Data 2025, 12, 268. [Google Scholar] [CrossRef]
- Silwattananusarn, T.; Kanarkard, W.; Tuamsuk, K. Enhanced Classification Accuracy for Cardiotocogram Data with Ensemble Feature Selection and Classifier Ensemble. J. Comput. Commun. 2016, 04, 20–35. [Google Scholar] [CrossRef]
- StatsKingdom. Wilcoxon Signed-Rank Test Calculator. Available online: https://www.statskingdom.com/175wilcoxon_signed_ranks.html (accessed on 31 March 2026).




| Author, Year | Problem | Aim of the Study | Data (Dataset) | Algorithms/Model | Evaluation Metrics | Key Results |
|---|---|---|---|---|---|---|
| Soltys et al., 2021 [4] | Student enrollment prediction | Predict student enrollment under competitive and budget constraints | CSUCI admissions data (2018–2020) | XGBoost | Accuracy, Precision, Recall, Specificity | Optimal threshold = 0.09; FN = 25%, FP = 39% |
| Yang et al., 2021 [5] | Freshman enrollment prediction | Predict freshman enrollment using Machine Learning methods | Guangzhou University data (2009–2012), 10,382 records | Decision Tree, Random Forest, BP Neural Network | Accuracy, Precision, Recall, F1-score | Decision Tree: ACC 62.94% (F1 0.6963); BP Neural Network: ACC 62.07% (F1 0.6967); Random Forest: ACC 60.60% (F1 0.6660) |
| Shao et al., 2022 [6] | Course enrollment prediction | Improve course enrollment prediction using CART and Random Forest | SDSU student data (2010–2019), ~83,000 records | CART, Random Forest, Conditional Probability Analysis | Error rate | Random Forest: 0.8% (best); CART: 15.5%; Flowchart: 21.2% |
| Ayasi et al., 2023 [7] | Course enrollment prediction | Evaluate Machine Learning and Neural Networks for course enrollment prediction | AAUP data (2018–2021), ~9,000 students, 137,000 records, imbalanced (10:1) | Logistic Regression, Stochastic Gradient Descent, k-Nearest Neighbors, CART, Gradient Boosting, Bagging, Support Vector Machine, Random Forest, MLP | Accuracy, Precision, Recall, F1-score | Random Forest: ACC 94%, F1 86% (best); MLP: ACC 91%, F1 79% |
| Brianorman & Sucipto, 2024 [8] | Admission prediction | Predict admission likelihood using classification models | Pontianak University data (2020), 1,892 records | Decision Tree, Random Forest, Naive Bayes | Accuracy, Precision, Recall, F1-score | Random Forest: ACC 59.2%, F1 0.574 (best); Decision Tree: ACC 59.1%; Naive Bayes: ACC 58.1% |
| Raftopoulos, Davrazos & Kotsiantis, 2024 [9] | Admission prediction | Predict admission outcomes with fair and interpretable Machine Learning models | MBA (synthetic, 6,194 cases), LSAC Law School (1,991 cases, 12 features), College Admission (7 features) | Logistic Regression, Decision Tree, Naive Bayes, Ensemble methods | Accuracy, AUC, Precision, Recall, F1-score, Kappa, MCC; Selection Rate, Disparate Impact, TPR, FPR | Logistic Regression: ACC 0.9097, F1 0.9516 (Law, best); Recall 0.9722 (MBA); Naive Bayes: high precision (consistent) |
| Ramos, 2024 [10] | Enrollment prediction | Develop a machine learning model for enrollment prediction and key factor analysis | BSIT data (2019–2023), 76 students survey | Linear Regression | Accuracy | Predicted: increase in 1st-year (up to 74) and 3rd/4th-year; decrease in 2nd-year (24 to 8) |
| Yang et al., 2024 [11] | Freshman enrollment prediction | Optimize freshman enrollment prediction using a voting ensemble | Chinese university data (6 years), 17,652 records (9,839 / 7,813), 18 features | Decision Tree, Random Forest, BP Neural Network (ensemble) | Accuracy, Precision, Recall, F1-score | Soft Voting: ACC 64.67%, F1 0.69 (best; outperforms single models) |
| He et al., 2025 [12] | Enrollment factor analysis | Identify key factors influencing enrollment (low-SES students) | HSLS:09 (5,223 low-SES 9th-grade students), 28 features | Logistic Regression, k-Nearest Neighbors, Support Vector Machine, Decision Tree, Random Forest | Accuracy, Macro Precision, Macro Recall, Macro F1-score, ROC-AUC | Random Forest: ACC 67.73%, F1 0.6999 (best); no overfitting |
| Name | Type | Category | Role in the Model | Description |
|---|---|---|---|---|
| Gender | categorical | demographic | input | Gender of the candidate applying for enrollment |
| Place of Residence | categorical | geographic | input | District of the candidate’s permanent residence |
| Residence in the Institution’s City | categorical | geographic | input | Information indicating whether the candidate resides in the same city as the higher education institution |
| Distance from Residence to Higher Education Institution | categorical | geographic | input | Discretized distance between the candidate’s place of residence and the higher education institution, divided into categories |
| Secondary School Type | categorical | educational | input | Type of secondary school completed by the candidate |
| Average Secondary School Grade | numeric | educational | input | Average grade achieved by the candidate during secondary education |
| Attendance of Preparatory Classes | categorical | educational | input | Information indicating whether the candidate attended preparatory classes for the entrance examination |
| Entrance Exam Score | numeric | educational | input | Number of points achieved by the candidate on the entrance examination |
| Total Enrollment Score | numeric | educational | input | Total number of points used for candidate ranking |
| Enrollment Period | categorical | administrative | input | The enrollment period in which the candidate applies |
| Application to Multiple Higher Education Institutions | categorical | administrative | input | Information indicating whether the candidate applied to multiple higher education institutions |
| Study Program | categorical | administrative | input | Study program for which the candidate applies |
| Tuition Funding Status | categorical | social | input | Information indicating whether the candidate is funded through the state budget or self-financed |
| Eligibility for Student Housing | categorical | social | input | Information indicating whether the candidate is eligible for student housing |
| Enrollment status | categorical | Candidate enrollment decision | target | Indicates whether the candidate completed the enrollment process or not |
| Name | Description |
|---|---|
| Logistic Regression (LR) | LR is a widely used statistical method for analyzing and predicting binary outcomes. In LR, the dependent variable takes two values, such as “yes” or “no.” The model estimates the probability of an event using a logistic function, which maps a linear combination of predictors to the interval [0, 1]. Although the relationship between the predictors and the outcome probability is nonlinear, LR assumes a linear relationship in the log-odds space. The model coefficients are interpreted as changes in log-odds and are often transformed into odds ratios for easier interpretation [13,14]. |
| Random Forest (RF) | RF is a popular machine learning algorithm that uses an ensemble of decision trees for classification and regression tasks. This ensemble method combines predictions from multiple trees trained on different data subsets and averages them for regression, thereby achieving significantly higher accuracy than a single tree. The model is recognized for its ability to handle complex, nonlinear relationships while minimizing the risk of overfitting. In addition to prediction, RFs are extremely useful for identifying the most important factors in the data through variable importance measures [15,16]. |
| Gradient Boosting (GB) | GB is an efficient ensemble machine learning method that sequentially builds a set of weak models, most commonly decision trees, to form a single strong predictive model. The key mechanism of this technique is an iterative process in which each new model attempts to correct the errors (residuals) of previous models by using information about the gradient, i.e., the slope of the loss function. In this manner, the algorithm gradually minimizes the overall loss and improves prediction accuracy through an additive process, making it well-suited for solving complex classification and regression problems [17,18]. |
| Neural Network (NN) | NN is a mathematical model that processes input data to generate accurate predictions. Its structure consists of layers of interconnected neurons—input, hidden, and output layers—that serve as universal approximators for solving complex problems. Each neuron computes a weighted sum of its inputs and applies an activation function, enabling the model to recognize and interpret complex relationships in the data. Learning occurs through iterative adjustment of connection weights (most commonly via the backpropagation algorithm), thereby gradually minimizing prediction error [19,20]. |
| Ensemble Model Based on the Stacking Method | The Stacking method is an advanced ensemble learning technique that combines multiple base models to achieve better predictive performance than any individual model. The process occurs at two levels: first, a set of base models is trained, and their predictions or probabilities serve as input features for a new dataset. In this new data space, a meta-model is trained to learn how to optimally integrate the base models’ outputs into the final result. The advantage of this method lies in its ability to synthesize complementary patterns and reduce variance and bias, making the final model more robust and accurate across a wide range of applications [21,22]. |
| Model | Hyperparameter | Value |
|---|---|---|
| Logistic Regression (LR) | Regularization strength (C) | 0.900 |
| Class balancing | Enabled | |
| Random Forest (RF) | Number of trees | 300 |
| Class balancing | Enabled | |
| Maximum tree depth | 3 | |
| Minimum number of instances for splitting | 5 | |
| Gradient Boosting (GB) | Number of trees | 300 |
| Learning rate | 0.050 | |
| Reproducible training | Enabled | |
| Maximum tree depth | 3 | |
| Neural Network (NN) | Number of neurons in hidden layers | 100 |
| Maximum number of iterations | 200 | |
| Reproducible training | Enabled |
| Model | AUC | Accuracy | F1 | Precision | Recall | MCC |
|---|---|---|---|---|---|---|
| Logistic Regression (LR) | 0.754 | 0.693 | 0.699 | 0.716 | 0.693 | 0.368 |
| Random Forest (RF) | 0.738 | 0.675 | 0.682 | 0.710 | 0.675 | 0.351 |
| Gradient Boosting (GB) | 0.739 | 0.719 | 0.707 | 0.708 | 0.719 | 0.350 |
| Neural Network (NN) | 0.751 | 0.719 | 0.708 | 0.709 | 0.719 | 0.352 |
| Stacking Ensemble | 0.759 | 0.700 | 0.706 | 0.722 | 0.700 | 0.382 |
| Model | AUC | Accuracy | F1 | Precision | Recall | MCC |
|---|---|---|---|---|---|---|
| 5-fold cross-validation | ||||||
| Logistic Regression (LR) | 0.755 | 0.693 | 0.699 | 0.717 | 0.693 | 0.371 |
| Random Forest (RF) | 0.740 | 0.673 | 0.681 | 0.710 | 0.673 | 0.349 |
| Gradient Boosting (GB) | 0.736 | 0.716 | 0.704 | 0.705 | 0.716 | 0.343 |
| Neural Network (NN) | 0.752 | 0.721 | 0.712 | 0.712 | 0.721 | 0.361 |
| Stacking Ensemble | 0.760 | 0.699 | 0.705 | 0.722 | 0.699 | 0.382 |
| 10-fold cross-validation | ||||||
| Logistic Regression (LR) | 0.754 | 0.693 | 0.699 | 0.716 | 0.693 | 0.368 |
| Random Forest (RF) | 0.738 | 0.675 | 0.682 | 0.710 | 0.675 | 0.351 |
| Gradient Boosting (GB) | 0.739 | 0.719 | 0.707 | 0.708 | 0.719 | 0.350 |
| Neural Network (NN) | 0.751 | 0.719 | 0.708 | 0.709 | 0.719 | 0.352 |
| Stacking Ensemble | 0.759 | 0.700 | 0.706 | 0.722 | 0.700 | 0.382 |
| 20-fold cross-validation | ||||||
| Logistic Regression (LR) | 0.754 | 0.693 | 0.699 | 0.716 | 0.693 | 0.368 |
| Random Forest (RF) | 0.739 | 0.676 | 0.683 | 0.713 | 0.676 | 0.355 |
| Gradient Boosting (GB) | 0.737 | 0.710 | 0.697 | 0.698 | 0.710 | 0.328 |
| Neural Network (NN) | 0.751 | 0.718 | 0.708 | 0.708 | 0.718 | 0.351 |
| Stacking Ensemble | 0.759 | 0.700 | 0.706 | 0.722 | 0.700 | 0.382 |
| Predicted | ||||
|---|---|---|---|---|
| Actual | 0 | 1 | ∑ | |
| 0 | 387 | 171 | 558 | |
| 1 | 308 | 731 | 1039 | |
| ∑ | 695 | 902 | 1597 | |
| Fold | Logistic Regression (MCC) | Stacking Ensemble (MCC) |
|---|---|---|
| 1 | 0.443 | 0.443 |
| 2 | 0.367 | 0.430 |
| 3 | 0.337 | 0.376 |
| 4 | 0.417 | 0.387 |
| 5 | 0.281 | 0.316 |
| 6 | 0.357 | 0.373 |
| 7 | 0.350 | 0.297 |
| 8 | 0.394 | 0.469 |
| 9 | 0.352 | 0.397 |
| 10 | 0.404 | 0.343 |
| Model | MCC Mean | MCC Standard Deviation |
|---|---|---|
| Logistic Regression (LR) | 0.370 | 0.044 |
| Stacking Ensemble | 0.383 | 0.052 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.