Submitted:
17 August 2024
Posted:
20 August 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Literature Review
- Integration of Big Data and Machine Learning for Real-Time Monitoring: Unlike many studies that focus solely on predictive modelling, this research integrates Big Data analytics with machine learning to enable real-time monitoring and interventions. This approach not only predicts at-risk students but also provides a framework for timely and personalized interventions by educational authorities.
- Focus on Early-Year Subjects: By identifying early-year subjects such as Mechanics and Materials, Design of Machine Elements, and Instrumentation and Control as critical factors influencing attrition, the study highlights the importance of early intervention. This focus on the longitudinal impact of specific subjects provides actionable insights for curriculum designers and educators.
- Systematic Exploration and Hyperparameter Optimization: The study conducts preliminary trials to refine machine learning models, establish evaluation standards, and optimize hyperparameters systematically. This rigorous approach ensures the robustness and reliability of the predictive model.
- Application of Random Forest Algorithm: The use of the random forest algorithm, known for its high prediction accuracy and ability to handle large datasets with many features, is another key contribution. The study justifies the selection of this algorithm and demonstrates its effectiveness in reducing overfitting and improving prediction accuracy.
3. Materials and Methods
3.1. Introduction
- The elective courses are not taken into consideration.
- For students who remodule, only the final grade is recorded, meaning that the failing grades are not registered on this datasheet.
- All failed grades are not registered.
- The student that dropped out may have taken more courses but failed, however, due to the failed grades not being registered, it is not shown on the data that the student has attempted the module.
- Students who are given exemptions were given a B grade in the datasheet to reflect the average performance of the course.
- Cohort only considered students enrolled from year 1, with entry requirements of A-levels, university foundation programmes or equivalent.
- Data Completeness: We assume that the datasets from educational institutions are complete and accurate reflect student performance and demographics. This is crucial as the model’s accuracy depends significantly on the quality of input data. In practice, we mitigate the risk of incomplete data by applying data imputation techniques and liaising with educational institutions to understand and fill gaps in data collection processes.
- General Education Framework: The model presupposes a relatively uniform educational structure within regions being analysed. This assumption allows us to generalise the predictive factors across different institutions within the same educational system. We validate this assumption through preliminary analysis of educational systems and curricula before model deployment.
- Consistency in Course Impact: We posit that certain courses have a more pronounced impact on student attrition rates across different institutions. This is based on historical data showing consistent patterns of student performance in key subjects that correlate with dropout rates. To ensure this assumption holds, we continuously update and recalibrate our model as new data becomes available, ensuring it reflects the most current educational trends.
- Student Behaviour Consistency: The model assumes that student behaviour and their impacts on attrition are consistent over time. While this may not capture new emerging trends immediately, the model includes mechanisms for periodic reassessment to integrate new behavioural patterns and external factors affecting student engagement and success.
- Socioeconomic Factors: It is assumed that socioeconomic factors influencing student attrition are similar within the data sample. This assumption allows us to apply the model across similar demographic groups but requires careful consideration when applying the model to regions with differing socioeconomic landscapes. We justify this by conducting localised studies to understand the socioeconomic dynamics before applying the model in a new region.
3.2. Procedure of Solution
Overview
Computational Tools and Libraries
- NumPy: Utilized for its comprehensive support for large, multi-dimensional arrays and matrices. This library enhances the efficiency of mathematical operations essential to machine learning and data analysis, making it indispensable for handling the volume and complexity of the dataset used. This was used to facilitate efficient data manipulation essential for the scale of our educational data.
- Pandas: Employed for data manipulation and cleaning. This library’s functionality allows for the efficient handling of missing data, transformation, and preparation of the dataset by enabling operations such as filtering, merging, and data type conversions which are crucial for preparing the dataset for analysis and predictive modeling.
- Matplotlib and Seaborn: These libraries are used for visualizing data. Visualization aids in the preliminary analysis by helping identify patterns, outliers, and the distribution of data, which are critical for strategic decision-making in model training in this case, the prediction of dropout rates.
- Scikit-learn: Chosen for implementing the random forest classifier. This library provides versatile tools for data mining and data analysis, allowing for robust model building, training, and validation. This was selected for its effectiveness in reducing overfitting and improving prediction accuracy in diverse datasets.
Justification for the Use of Random Forest and Parameter Selection
Data Cleaning Process
- Conversion and Cleaning: The raw data, initially in Excel format, was converted to a CSV format to standardize the data input process for use in Python. This step was crucial as it ensured compatibility with the Pandas library for subsequent manipulations.
- Error Checking and Noise Reduction: The dataset was meticulously checked for errors such as misspellings, incorrect punctuation, and inconsistent spacing which could lead to inaccuracies in the model. Irrelevant data points such as unnecessary identifiers were removed to streamline the dataset and focus the model on relevant features.
- Normalization and Encoding: Numerical normalization and categorical data encoding were performed to standardize the scales and transform categorial variables such as grades and nationality, into a format suitable for machine learning analysis.
Model Training and Testing
- Demographic Information: Understanding the socioeconomic and cultural background of students helps tailor the predictive capabilities.
- Academic Performance Data: Access to comprehensive performance metric across various subjects is vital to identify at-risk students early.
- Institutional Data: Information on the educational system’s structure, including course offerings and academic policies, is necessary for contextual adaptation.
4. Results and Discussion
4.1. Introduction
4.2. Effect in Difference of Test Size vs Training Size
4.3. Average Grades of Each Module
4.4. In Depth Analysis of Subjects Per Classification
4.5. Cohort Performance Analysis
4.6. Cross-Validation of Results for Accurary
4.7. Longitudinal Effect in Student Attrition
4.8. Pros and Cons of Using Machine Learning
Advantages of Machine Learning
- Enhanced Predictive Accuracy: ML algorithms are capable of processing and learning from vast amounts of data, detecting complex patterns that are not apparent through manual analysis or simplier models. This capacity significantly improves prediction accuracy over traditional methods, which often rely on fewer variables and assume linear relationships.
- Automation and Efficiency: Ml automates the analysis of large data sets, reducing the reliance on manual data handing and analysis. This can lead to significant time savings and resource efficency, particularily valuable in educational settings where early detection of at-risk students can lead to timely and effective interventions.
- Scalability: Unlike traditional methods that may become cumbersome as dataset sizes increase, ML algorithms excel at scaling with data, maintaining their effectiveness across larger and diverse datasets.
Limitations of Machine Learning
- Risk of False Positives: One of the significant challenges with Ml is the risk of generating false positives ⸺incorrectly predicting that a student may dorup out. This can lead to unnecessary interventions, potentially wasting resources and adversely affecting the student involed.
- Data Privacy and Security: The need for substantial data to train ML models raises concerns about privacy and data security, especially when sensitve student information is involved. Ensuring the integrity and security of this data is paramount but can be resourcer-intensive.
- Complexity and Resource Requirements: ML models, particularly those like random forest or neural networks, are complex and require significant computational resources and expertise to develop, maintain, and interpret. This may pose barriers for insitutions without sufficient tehcnical staff or infrastructure.
Comparision with Other Methods
Justification for Discussing Pros and Cons
- Data Collection: Detailed the process of gathering data from institutional databases, including student demographics, academic records, and survey responses.
- Feature Selection: Employed statistical methods and domain expertise to select relevant features, such as socio-economic background, psychological factors, and academic performance indicators.
- Machine Learning Algorithms: Utilized Random Forests due to their robustness and ability to handle large datasets with missing values. Other algorithms considered include Support Vector Machines and Neural Networks.
- Hyperparameter Optimization: Implemented grid search and cross-validation techniques to fine-tune hyperparameters, ensuring optimal model performance and avoiding overfitting.
5. Conclusion
6. Future Work
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Chen, X. (2015). STEM attrition among high-performing college students: Scope and potential causes. Journal of Technology and Science Education, 5(1), 41-59. [CrossRef]
- Christle, C. A., Jolivette, K., & Nelson, C. M. (2007). School characteristics related to high school dropout rates. Remedial and Special education, 28(6), 325-339. [CrossRef]
- Lee, J., Lapira, E., Bagheri, B., & Kao, H. A. (2013). Recent advances and trends in predictive manufacturing systems in big data environment. Manufacturing letters, 1(1), 38-41. [CrossRef]
- Del Bonifro, F., Gabbrielli, M., Lisanti, G., & Zingaro, S. P. (2020). Student drouput prediction. In International conference on artificial intelligence in education.
- Sanders, M. (2009). STEM, STEM education, STEMmania. the technology teacher.
- Martín-Páez, T., Aguilera, D., Perales-Palacios, F. J., & Vílchez-González, J. M. (2019). What are we talking about when we talk about STEM education? A review of literature. Science Education, 103(4), 799-822. [CrossRef]
- Merrill, C., & Daugherty, J. (2009). The future of TE masters degrees: STEM.
- Zollman, A. (2012). Learning for STEM literacy: STEM literacy for learning. School Science and Mathematics, 112(1), 12-19. [CrossRef]
- Favaretto, M., De Clercq, E., Schneble, C. O., & Elger, B. S. (2020). What is your definition of Big Data? Researchers’ understanding of the phenomenon of the decade. PloS one, 15(2), e0228987. [CrossRef]
- Kitchin, R., & Lauriault, T. P. (2015). Small data in the era of big data. GeoJournal, 80(4), 463-475. [CrossRef]
- Dash, S., Shakyawar, S. K., Sharma, M., & Kaushik, S. Big data in healthcare: management, analysis and future prospects. J Big Data. 2019; 6: 54. [CrossRef]
- Wang, L., & Alexander, C. A. (2015). Big data in design and manufacturing engineering. American Journal of Engineering and Applied Sciences, 8(2), 223.
- Wu, J., Guo, S., Li, J., & Zeng, D. (2016). Big data meet green challenges: Big data toward green applications. IEEE Systems Journal, 10(3), 888-900. [CrossRef]
- El Naqa, I., & Murphy, M. J. (2015). What is machine learning? (pp. 3-11). Springer International Publishing.
- Xie, J., Yu, F. R., Huang, T., Xie, R., Liu, J., Wang, C., & Liu, Y. (2018). A survey of machine learning techniques applied to software defined networking (SDN): Research issues and challenges. IEEE Communications Surveys & Tutorials, 21(1), 393-430. [CrossRef]
- Aung, K.H.H.; Kok, C.L.; Koh, Y.Y.; Teo, T.H. An Embedded Machine Learning Fault Detection System for Electric Fan Drive. Electronics 2024, 13, 493. [CrossRef]
- Kok, C.L.; Ho, C.K.; Tan, F.K.; Koh, Y.Y. Machine Learning-Based Feature Extraction and Classification of EMG Signals for Intuitive Prosthetic Control. Appl. Sci. 2024, 14, 5784. [CrossRef]
- Chen, J.; Teo, T.H.; Kok, C.L.; Koh, Y.Y. A Novel Single-Word Speech Recognition on Embedded Systems Using a Convolution Neuron Network with Improved Out-of-Distribution Detection. Electronics 2024, 13, 530. [CrossRef]
- Kok, C.L.; Ho, C.K.; Tan, F.K.; Koh, Y.Y. Machine Learning-Based Feature Extraction and Classification of EMG Signals for Intuitive Prosthetic Control. Appl. Sci. 2024, 14, 5784. [CrossRef]
- Kok, C.L.; Ho, C.K.; Dai, Y.; Lee, T.K.; Koh, Y.Y.; Chai, J.P. A Novel and Self-Calibrating Weighing Sensor with Intelligent Peristaltic Pump Control for Real-Time Closed-Loop Infusion Monitoring in IoT-Enabled Sustainable Medical Devices. Electronics 2024, 13, 1724. [CrossRef]
- Kok, C.L.; Dai, Y.; Lee, T.K.; Koh, Y.Y.; Teo, T.H.; Chai, J.P. A Novel Low-Cost Capacitance Sensor Solution for Real-Time Bubble Monitoring in Medical Infusion Devices. Electronics 2024, 13, 1111. [CrossRef]
- Kok, C.L.; Ho, C.K.; Lee, T.K.; Loo, Z.Y.; Koh, Y.Y.; Chai, J.P. A Novel and Low-Cost Cloud-Enabled IoT Integration for Sustainable Remote Intravenous Therapy Management. Electronics 2024, 13, 1801. [CrossRef]
- C. Romero and S. Ventura, “Educational data mining: A review of the state of the art,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 40, no. 6, pp. 601-618, 2010. [CrossRef]
- G. Siemens and R. S. J. d. Baker, “Learning analytics and educational data mining: towards communication and collaboration,” in Proceedings of the 2nd International Conference on Learning Analytics and Knowledge, 2012, pp. 252-254.
- R. S. J. d. Baker and K. Yacef, “The state of educational data mining in 2009: A review and future visions,” Journal of Educational Data Mining, vol. 1, no. 1, pp. 3-17, 2009.
- R. S. J. d. Baker and G. Siemens, “Educational data mining and learning analytics,” in Cambridge Handbook of the Learning Sciences, 2nd ed., Cambridge University Press, 2014.
- A. Pardo and G. Siemens, “Ethical and privacy principles for learning analytics,” British Journal of Educational Technology, vol. 45, no. 3, pp. 438-450, 2014. [CrossRef]
- C. A. Shaffer et al., “The role of educational data mining in improving learning outcomes: A case study,” in Proceedings of the 4th International Conference on Educational Data Mining, 2011, pp. 11-20.
- H. Drachsler and W. Greller, “Privacy and analytics: it’s a DELICATE issue,” in Proceedings of the 6th International Conference on Learning Analytics & Knowledge, 2016, pp. 89-98.
- G. Siemens et al., “Open learning analytics: an integrated & modularized platform,” Proceedings of the 1st International Conference on Learning Analytics and Knowledge, 2011.
- S. K. D’Mello, “Improving student success using educational data mining techniques: Predictive modeling and intervention development,” IEEE Transactions on Learning Technologies, vol. 9, no. 2, pp. 108-114, 2016.
- C. Romero and S. Ventura, “Educational data mining: A survey from 1995 to 2005,” Expert Systems with Applications, vol. 33, no. 1, pp. 135-146, 2007. [CrossRef]
- J. A. Rice, Learning analytics: Understanding, improving, and applying insights from educational data, Taylor & Francis, 2017.
- R. Ferguson, “The state of learning analytics in 2012: A review and future challenges,” Technical Report, 2012.
- Delen, D. (2011). Predicting student attrition with data mining methods. Journal of College Student Retention: Research, Theory & Practice, 13(1), 17-35. [CrossRef]
- Barramufio, M., Meza-Narvaez, C., & Galvez-Garcia, G. (2022). Prediction of student attrition risk using machine learning. Journal of Applied Research in Higher Education, 14(3), 974-986. [CrossRef]
- Binu, V. S., Mayya, S. S., & Dhar, M. (2014). Some basic aspects of statistical methods and sample size determination in health science research. Ayu, 35(2), 119. [CrossRef]
- Hao, J., & Ho, T. K. (2019). Machine learning made easy: a review of scikit-learn package in python programming language. Journal of Educational and Behavioral Statistics, 44(3), 348-361. [CrossRef]
- L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5-32, 2001. [Online]. Available: https://link.springer.com/article/10.1023/A:1010933404324. [CrossRef]
- J. Brownlee, “Why Use Random Forest for Machine Learning?,” Machine Learning Mastery, 2020. [Online]. Available: https://machinelearningmastery.com/why-use-random-forest/.
- W. McKinney, “Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython,” 2nd ed., O’Reilly Media, 201.




















| NO. | GRADE |
|---|---|
| 0 | A+ |
| 16 | A+ |
| 13 | A- |
| 26 | B+ |
| 48 | B |
| 37 | B- |
| 12 | C+ |
| 7 | C |
| 38 | C- |
| Classification | A | B | C |
|---|---|---|---|
| COMPUTER SCIENCE | 27 | 71 | 99 |
| ELECTRICAL AND ELECTRONIC ENGINEERING | 36 | 65 | 96 |
| GENERAL ENGINEERING | 45 | 68 | 84 |
| MANAGEMENT AND OPERATIONS | 39 | 64 | 95 |
| MATERIALS | 35 | 58 | 98 |
| MATHEMATICS | 51 | 55 | 100 |
| MECHANICS | 30 | 56 | 72 |
| THERMOFLUIDS | 36 | 69 | 92 |
| THESIS | 41 | 74 | 82 |
| COHORT | SD | Mean |
|---|---|---|
| 3 | 2.18 | B |
| 6 | 2.19 | B+ |
| 4 | 2.19 | B |
| 11 | 2.41 | B |
| 5 | 2.41 | C+ |
| 10 | 2.42 | B |
| 2 | 2.45 | B |
| 8 | 2.74 | B- |
| 1 | 2.78 | B- |
| 7 | 2.80 | B |
| 12 | 2.89 | B |
| 9 | 2.94 | B |
| 13 | 3.05 | B |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).