Submitted:
12 January 2025
Posted:
14 January 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. Background and Business Goal

1.2. Project Goal
1.2.1. Data Exploration
1.2.2. Feature Identification
1.2.3. Predictive Modeling
1.2.4. Insights and Recommendations
2. Data Set Description


3. Data Preprocessing
| Data Type | Feature | Description |
| Numeric | age | The age of the student in numerical value. |
| absences | The number of school absences in numerical value. The range is 0-93. | |
| G1 | First period grade achieved by the student. The range is 0-20. | |
| G2 | Second period grade achieved by the student. The range is 0-20. | |
| G3 | Final year grade achieved by the student. This is the target variable. The range is 0-20. | |
| Fedu | The overall educational qualification of the student’s father represented as a number 0-4: | |
| - 0: No education | ||
| - 1: Primary education till 4th grade | ||
| - 2: 5th to 9th grade education | ||
| - 3: Secondary education | ||
| - 4: Higher education | ||
| Medu | The overall educational qualification of the student’s mother represented as a number 0-4 (same levels as Fedu). | |
| travel time | Time taken for students to travel to school, measured in range 1-4: | |
| - 1: Less than 15 minutes | ||
| - 2: 15-30 minutes | ||
| - 3: 30 minutes to 1 hour | ||
| - 4: More than 1 hour | ||
| failures | Number of times the student has failed in a subject, measured in range 0-4: | |
| - 0: No failures | ||
| - 1: Failed one class | ||
| - 2: Failed two classes | ||
| - 3: Failed three classes | ||
| - 4: Failed four or more classes | ||
| health | The health situation of students, measured in numbers from 1-5, where 1 means very bad and 5 means very good. | |
| Walc | Students’ weekend alcohol consumption, measured in range 1-5, where 1 is very low and 5 is very high. | |
| Categorical | sex | The sex (gender) of the student. "M" stands for Male and "F" stands for Female. |
| study time | Total amount of time spent studying per week, measured in range 1-4: | |
| - 1: Less than 2 hours per week | ||
| - 2: 2-5 hours per week | ||
| - 3: 5-10 hours per week | ||
| - 4: More than 10 hours per week | ||
| activities | Whether the student participates in extracurricular activities, recorded as “Yes” or “No”. |
4. Target Variable and Data Quality
4.1. High-Level Statistics

| Percentile | Study Time |
| 25 % (first quartile) | 1 |
| 50 % (second quartile) | 2 |
| 75% (third quartile) | 2 |
| Percentile | Absences |
| 25 % (first quartile) | 0 |
| 50 % (second quartile) | 4 or less (≤ 4) |
| 75% (third quartile) | 8 or less ( ≤ 8) |
5. Proposed Methodology
5.1. Data Collection and Description
5.2. Data Preprocessing
5.3. Exploratory Data Analysis (EDA)
5.4. Predictive Modeling
5.5. Linear Regression
5.6. Model Validation
5.7. Insights and Recommendations
6. Identification of Data Issues

6. Refining the Dataset: Correlation, Outliers, and Techniques
7.1. Correlation Analysis
- Lower Bound: Q1−1.5×IQRQ1 - 1.5 \times IQRQ1−1.5×IQR
- Upper Bound: Q3+1.5×IQRQ3 + 1.5 \times IQRQ3+1.5×IQR


7.1. Exploratory Data Analysis (EDA)
- Pair Plots: Highlighted relationships between variables, such as the correlation between G3 (target variable) and other predictors.
- Box Plots: Used for identifying outliers through the IQR method.
- Histograms: Compared key features, such as absences and G3, to assess their relationship.
- Residual Plots: Confirmed the absence of systematic bias and validated model assumptions.



7.2. Correlation Analysis
7.3. Predictive Modeling
7.4. Rationale for Technique Selection
7.6. Model Validation


8. Conclusion
References
- Nguyen, X., Tran, Y., & Le, Z. (2021). Data analytics in education: Identifying performance determinants and supporting personalized learning strategies. Journal of Educational Data Science, 15(3), 45-60.
- Misinem, M., Kurniawan, T. B., Dewi, D. A., Zakaria, M. Z., & Nazmi, C. M. A. (2024). Leveraging Data Analytics for Student Grade Prediction: A Comparative Study of Data Features. Journal of Applied Data Sciences, 5(4), 2025-2038. [CrossRef]
- Alam, A. (2023, April). The Secret Sauce of Student Success: Cracking the Code by Navigating the Path to Personalized Learning with Educational Data Mining. In 2023 2nd International Conference on Smart Technologies and Systems for Next Generation Computing (ICSTSN) (pp. 1-8). IEEE. [CrossRef]
- Alferidah, D. K., & Jhanjhi, N. Z. (2020, October). Cybersecurity impact over big data and IoT growth. In 2020 International Conference on Computational Intelligence (ICCI) (pp. 103-108). IEEE. [CrossRef]
- Jena, K. K., Bhoi, S. K., Malik, T. K., Sahoo, K. S., Jhanjhi, N. Z., Bhatia, S., & Amsaad, F. (2022). E-learning course recommender system using collaborative filtering models. Electronics, 12(1), 157. [CrossRef]
- Aherwadi, N., Mittal, U., Singla, J., Jhanjhi, N. Z., Yassine, A., & Hossain, M. S. (2022). Prediction of fruit maturity, quality, and its life using deep learning algorithms. Electronics, 11(24), 4100. [CrossRef]
- Karo, M. B., Miller, B. P., & Al-Kamari, O. A. (2024). Leveraging data utilization and predictive analytics: Driving innovation and enhancing decision making through ethical governance. International Transactions on Education Technology (ITEE), 2(2), 152-162. [CrossRef]
- Nwosu, N. T., Babatunde, S. O., & Ijomah, T. (2024). Enhancing customer experience and market penetration through advanced data analytics in the health industry. [CrossRef]
- Eboigbe, E. O., Farayola, O. A., Olatoye, F. O., Nnabugwu, O. C., & Daraojimba, C. (2023). Business intelligence transformation through AI and data analytics. Engineering Science & Technology Journal, 4(5), 285-307. [CrossRef]
- Saeed, S., & Abdullah, A. (2022). Hybrid graph cut hidden Markov model of K-mean cluster technique. CMC-Computers, Materials & Continua, 1–15.
- Saeed, S., & Haron, H. (2021). Improved correlation matrix of discrete Fourier transformation (CM-DFT) technique for finding the missing values of MRI images. Mathematical Biosciences and Engineering, 1–22. [CrossRef]
- Saeed, S. (2017). Implementation of failure enterprise systems in an organizational perspective framework. International Journal of Advanced Computer Science and Applications, 8(5), 54–63. [CrossRef]
- Ashaari, M. A., Singh, K. S. D., Abbasi, G. A., Amran, A., & Liebana-Cabanillas, F. J. (2021). Big data analytics capability for improved performance of higher education institutions in the Era of IR 4.0: A multi-analytical SEM & ANN perspective. Technological Forecasting and Social Change, 173, 121119. [CrossRef]
- Ajegbile, M. D., Olaboye, J. A., Maha, C. C., & Tamunobarafiri, G. (2024). Integrating business analytics in healthcare: Enhancing patient outcomes through data-driven decision making. World J Biol Pharm Health Sci, 19, 243-50. [CrossRef]
- Farooq, U., Naseem, S., Mahmood, T., Li, J., Rehman, A., Saba, T., & Mustafa, L. (2024). Transforming educational insights: Strategic integration of federated learning for enhanced prediction of student learning outcomes. The Journal of Supercomputing, 1-34. [CrossRef]
- Mahawar, K., & Rattan, P. (2024, August). Optimizing Educational Outcome Prediction: Leveraging Chi-Square Method for Effective Feature Selection. In 2024 IEEE 5th India Council International Subsections Conference (INDISCON) (pp. 1-6). IEEE. [CrossRef]
- Arnika, Sharma, R. K., Kanaujia, V. K., Yadav, S. P., & Al-Turjman, F. (2024, February). Leveraging Big Data Analytics to Enhance Networking Performance in Intelligent IoT. In International Conference On Artificial Intelligence Of Things For Smart Societies (pp. 117-124). Cham: Springer Nature Switzerland. [CrossRef]
- Baniata, L. H., Kang, S., Alsharaiah, M. A., & Baniata, M. H. (2024). Advanced Deep Learning Model for Predicting the Academic Performances of Students in Educational Institutions. Applied Sciences, 14(5), 1963. [CrossRef]
- Nicholas, I., Kuo, H., Perez-Concha, O., Hanly, M., Mnatzaganian, E., Hao, B., ... & Barbieri, S. (2024). Enriching Data Science and Health Care Education: Application and Impact of Synthetic Data Sets Through the Health Gym Project. JMIR Medical Education, 10(1), e51388. [CrossRef]
- Ajegbile, M. D., Olaboye, J. A., Maha, C. C., & Tamunobarafiri, G. (2024). Integrating business analytics in healthcare: Enhancing patient outcomes through data-driven decision making. World J Biol Pharm Health Sci, 19, 243-50. [CrossRef]
- Saeed, S., Abdullah, A., Jhanjhi, N. Z., Naqvi, M., & Nayyar, A. (2022). New techniques for efficiently k-NN algorithm for brain tumor detection. Multimedia Tools and Applications, 81(13), 18595–18616. [CrossRef]
- Saeed, S. (2024). Improved hybrid K-nearest neighbors’ techniques in segmentation of low-grade tumor and cerebrospinal fluid. [CrossRef]
- Gopi, R., Sathiyamoorthi, V., Selvakumar, S., Manikandan, R., Chatterjee, P., Jhanjhi, N. Z., & Luhach, A. K. (2022). Enhanced method of ANN based model for detection of DDoS attacks on multimedia internet of things. Multimedia Tools and Applications, 1-19. [CrossRef]
- Dogra, V., Singh, A., Verma, S., Kavita, Jhanjhi, N.Z., Talib, M.N. (2021). Analyzing DistilBERT for Sentiment Classification of Banking Financial News. In: Peng, SL., Hsieh, SY., Gopalakrishnan, S., Duraisamy, B. (eds) Intelligent Computing and Innovation on Data Science. Lecture Notes in Networks and Systems, vol 248. Springer, Singapore. [CrossRef]
- Alex, S. A., Jhanjhi, N. Z., Humayun, M., Ibrahim, A. O., & Abulfaraj, A. W. (2022). Deep LSTM model for diabetes prediction with class balancing by SMOTE. Electronics, 11(17), 2737. [CrossRef]
- Chesti, I. A., Humayun, M., Sama, N. U., & Jhanjhi, N. Z. (2020, October). Evolution, mitigation, and prevention of ransomware. In 2020 2nd International Conference on Computer and Information Sciences (ICCIS) (pp. 1-6). IEEE. [CrossRef]
- Alkinani, M. H., Almazroi, A. A., Jhanjhi, N. Z., & Khan, N. A. (2021). 5G and IoT based reporting and accident detection (RAD) system to deliver first aid box using unmanned aerial vehicle. Sensors, 21(20), 6905. [CrossRef]
- Alferidah, D. K., & Jhanjhi, N. Z. (2020, October). Cybersecurity impact over bigdata and iot growth. In 2020 International Conference on Computational Intelligence (ICCI) (pp. 103-108). IEEE. [CrossRef]
- Humayun, M., Jhanjhi, N. Z., Hamid, B., & Ahmed, G. (2020). Emerging smart logistics and transportation using IoT and blockchain. IEEE Internet of Things Magazine, 3(2), 58-62. [CrossRef]
- Srinivasan, K., Garg, L., Chen, B. Y., Alaboudi, A. A., Jhanjhi, N. Z., Chang, C. T., ... & Deepa, N. (2021). Expert System for Stable Power Generation Prediction in Microbial Fuel Cell. Intelligent Automation & Soft Computing, 30(1). [CrossRef]
- Humayun, M., Niazi, M., Jhanjhi, N. Z., Mahmood, S., & Alshayeb, M. (2023). Toward a readiness model for secure software coding. Software: Practice and Experience, 53(4), 1013-1035. [CrossRef]
- Mughal, M. A., Ullah, A., Cheema, M. A. Z., Yu, X., & Jhanjhi, N. Z. (2024). An intelligent channel assignment algorithm for cognitive radio networks using a tree-centric approach in IoT. Alexandria Engineering Journal, 91, 152-160. [CrossRef]
- Konatham, B., Simra, T., Amsaad, F., Ibrahem, M. I., & Jhanjhi, N. Z. (2024). A Secure Hybrid Deep Learning Technique for Anomaly Detection in IIoT Edge Computing. Authorea Preprints. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).