Submitted:
23 July 2025
Posted:
24 July 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- We constructed a real-world dataset of 2,238 AI-related courses collected from Udemy using multiple web scraping sessions, followed by rigorous cleaning and preprocessing to ensure data quality.
- A novel hybrid recommendation architecture is introduced, combining TF-IDF for lexical feature extraction, BERT embeddings for contextual semantic representation, and a Random Forest classifier to enhance predictive accuracy.
- The proposed system addresses the cold-start problem effectively by relying solely on course metadata, eliminating the need for historical user interaction data.
- Extensive empirical evaluation demonstrates that the proposed approach significantly outperforms state-of-the-art baselines, achieving a recommendation accuracy of 91.25% and an F1-score of 90.77%.
- The entire system is implemented and deployed as a real-time interactive web application using Flask, providing users with immediate and highly relevant AI course recommendations in a user-friendly interface.
2. Literature Review
3. Theoretical Backgrounds
3.1. Techniques Used in Feature Extraction
3.1.1. TF-IDF Vectorization
3.1.2. Bidirectional Encoder Representations from Transformers (BERT)
3.1. Random Forest
4. Methodology
4.1. Data Collection
4.2. Data Preprocessing
- Loading the Dataset: Loading the dataset using Python and the Pandas library. This allowed for the handling of course attributes, such as name, description, instructor name, price, and rating. This step is preparing the dataset for further processing.
- Data Cleaning: Several steps were implemented to ensure the dataset was consistent and accurate, including text normalization, removal of extra spaces and special characters, and standardization of numerical fields. To enable speedier processing, these procedures centered on standardizing textual and numerical values.
- Handling Missing Values: While processing the dataset, we identified missing information, including instructor names and course descriptions, which required correction.
- 4.
- Removing Duplicates: A few courses were duplicated due to the combined sessions of scraping. Initially, used Python’s drop_duplicates() function to remove duplicate values, but some courses still appeared twice because they had minor differences in other columns. Since those courses also shared the same URL, we removed duplicates based on their URLs (course-URL). That way, we ensured that each course appeared only once. Cleaning up these duplicates guaranteed that the dataset was accurate and well-structured.
- 5.
-
Text Preprocessing: For the textual processing of the data, we employed natural language preprocessing to calculate similarities between the course titles and descriptions. The techniques were:
- Lemmatization & Tokenization: The course names were tokenized into individual words and lemmatized for the minimization of the words into the most fundamental form (e.g., “learning” → “learn”).
- Stop word Removal: The universally present non-descriptive words within the texts’ context were eliminated to make the dataset more informative and efficient for machine learning models.
- 6.
- Final Dataset and Export: The cleaned dataset will be used for feature extraction and the development of the recommendation model.his section may be divided by subheadings. It should provide a concise and precise description of the experimental results, their interpretation, as well as the experimental conclusions that can be drawn.
4.3. Feature Extraction
4.3.1. Term Frequency-Inverse Document Frequency (TF-IDF)
4.3.2. BERT Embeddings
4.3.3. Fuzzy Matching for User Queries
4.4. Random Forest Model
4.5. Content-Based Filtering
4.6. Evaluation Metrics
- Accuracy: is the number of relevant courses predicted correctly divided by the total number of courses. It provides a rough estimate of the system’s performance. The formula for accuracy is:
- Precision: is the number of recommended courses that were relevant. Precision is the proportion of true positives to all the predicted positive cases. The formula is:
- Recall: is a metric of the number of correct courses that were correctly recommended. It is the number of true positives divided by the total number of actual correct courses. The formula is:
- F1-Score: is a blend of precision and recall, which provides a trade-off between the two. It is most effective when you need to evaluate the model’s performance, particularly when both false positives and false negatives are high. The formula is:
- Mean Squared Error (MSE): computes the mean of the squared difference between the predicted and true relevance values of the courses. It is used to estimate the extent to which the recommendations deviate from the actual expected relevance. The formula is:
- Mean Absolute Error (MAE): approximates the mean of the absolute error between the estimated and actual values of relevance
- Mean Relative Error (MRE): MRE finds the difference between expected values and actual values, showing it as the number of times the expected value differs. It is good when you need to see the size of the error relative to the real values, not the error itself.
4.7. Flask Web Application
5. Results and Discussion
5.1. Results of the Feature Extraction and Representation
5.1.1. TF-IDF (Term Frequency-Inverse Document Frequency)
5.1.2. BERT (Bidirectional Encoder Representations from Transformers) Embeddings
5.2. Model Training and Evaluation
5.2.1. Performance Analysis
5.3. Deployment of the Flask Web Application
5.4. Comparing the Proposed System with Other Recommenders
5.5. Use Case Scenario
6. Conclusions and Future Works
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Gm, D., Goudar, R. H., Kulkarni, A. A., Rathod, V. N., & Hukkeri, G. S. A digital recommendation system for personalized learning to enhance online education: A review. IEEE Access, 2024; 12: 34019-34041. [CrossRef]
- Yurchenko, A., Drushlyak, M., Sapozhnykov, S., Teplytska, A., Koroliova, L., & Semenikhina, O. Using Online IT-Industry Courses in Computer Sciences Specialists’ Training. International Journal of Computer Science & Network Security, 2021:21(11):97-104. [CrossRef]
- Madhavi, A., Nagesh, A., & Govardhan, A. A study on E-Learning and recommendation system. Recent Advances in Computer Science and Communications (Formerly: Recent Patents on Computer Science), 2022; 15(5): 748-764. [CrossRef]
- Urdaneta-Ponte, M. C., Mendez-Zorrilla, A., & Oleagordia-Ruiz, I. Recommendation systems for education: Systematic review. Electronics, 2021; 10(14); 1611. [CrossRef]
- Algarni, S., & Sheldon, F. Systematic Review of Recommendation Systems for Course Selection. Machine Learning and Knowledge Extraction, 2023; 5(2): pp. 560-596. [CrossRef]
- Hassan, R. H., Hassan, M. T., Sameem, M. S. I., & Rafique, M. A. Personality-Aware Course Recommender System Using Deep Learning for Technical and Vocational Education and Training. Information, 2024; 15(12): 803. [CrossRef]
- Zhang, S., Yao, L., Sun, A., & Tay, Y. Deep learning based recommender system: A survey and new perspectives. ACM computing surveys (CSUR), 2019; 52(1): 1-38. [CrossRef]
- Zhong, M., & Ding, R. Design of a personalized recommendation system for learning resources based on collaborative filtering. International Journal of Circuits, Systems and Signal Processing, 2022; 16(1): 122-131. [CrossRef]
- Guo, Q., Zhuang, F., Qin, C., Zhu, H., Xie, X., Xiong, H., & He, Q. A survey on knowledge graph-based recommender systems. IEEE Transactions on Knowledge and Data Engineering, 2020; 34(8): pp. 3549-3568. [CrossRef]
- Burke, R. Hybrid Recommender Systems: Survey and Experiments. User Model User-Adap Inter, 2002; 12: 331–370. [CrossRef]
- Ramzan, B., Bajwa, I. S., Jamil, N., Amin, R. U., Ramzan, S., Mirza, F., & Sarwar, N. An intelligent data analysis for recommendation systems using machine learning. Scientific Programming, 2019; 2019(1): 5941096. [CrossRef]
- Usman, A., Roko, A., Muhammad, A. B., & Almu, A. Enhancing personalized book recommender system. International Journal of Advanced Networking and Applications, 2022; 14(3): 5486-5492. [CrossRef]
- Thakkar, A., & Chaudhari, K. Predicting stock trend using an integrated term frequency–inverse document frequency-based feature weight matrix with neural networks. Applied Soft Computing, 2020; 96: 106684. [CrossRef]
- Zalte, J., & Shah, H. Contextual classification of clinical records with bidirectional long short-term memory (Bi-LSTM) and bidirectional encoder representations from transformers (BERT) model. Computational Intelligence, 2024; 40(4): e12692. [CrossRef]
- Selva Birunda, S., & Kanniga Devi, R. A review on word embedding techniques for text classification. Innovative Data Communication Technologies and Application: Proceedings of ICIDCA 2020,(03 February 2021), (pp. 267-281), 2021. [CrossRef]
- Parmar, A., Katariya, R., & Patel, V. A review on random forest: An ensemble classifier. In International conference on intelligent data communication technologies and internet of things (ICICI) 2018, (21 December 2018). (pp. 758-763), 2018. [CrossRef]
- Monsalve-Pulido, J., Aguilar, J., Montoya, E., & Salazar, C. Autonomous recommender system architecture for virtual learning environments. Applied Computing and Informatics, 2024; 20(1/2):69-88. [CrossRef]
- Chen, W., Shen, Z., Pan, Y., Tan, K., & Wang, C. Applying machine learning algorithm to optimize personalized education recommendation system. Journal of Theory and Practice of Engineering Science, 2024; 4(01): 101-108. [CrossRef]
- Tian, Y., Zheng, B., Wang, Y., Zhang, Y., & Wu, Q. College library personalized recommendation system based on hybrid recommendation algorithm. procedia cirp, 2019;83: 490-494. [CrossRef]
- Dai, Y., Takami, K., Flanagan, B., & Ogata, H. Beyond recommendation acceptance: Explanation’s learning effects in a math recommender system. Research and Practice in Technology Enhanced Learning, 2024; 19:1-21. [CrossRef]
- Li, Q.; Kim, J. A Deep Learning-Based Course Recommender System for Sustainable Development in Education. Appl. Sci. 2021, 11, 8993. [CrossRef]
- Guruge, D.B.; Kadel, R.; Halder, S.J. The State of the Art in Methodologies of Course Recommender Systems—A Review of Recent Research. Data 2021, 6, 18. [CrossRef]
- Lee, E.L.; Kuo, T.T.; Lin, S.D. A Collaborative Filtering-Based Two Stage Model with Item Dependency for Course Recommendation. In Proceedings of the 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Tokyo, Japan, 19–21 October 2017; pp. 496–503. [CrossRef]
- Pawar, A., Patil, P., Hiwanj, R., Kshatriya, A., Chikmurge, D., & Barve, S. Language Model Embeddings to Improve Performance in Downstream Tasks. In 2024 IEEE 16th International Conference on Computational Intelligence and Communication Networks (CICN), (December2024), (pp. 1097-1101), 2024.
- Javed, U., Shaukat, K., Hameed, I. A., Iqbal, F., Alam, T. M., & Luo, S. A review of content-based and context-based recommendation systems. International Journal of Emerging Technologies in Learning (iJET), 2021; 16(3): 274-306. [CrossRef]
- Ghatora, P. S., Hosseini, S. E., Pervez, S., Iqbal, M. J., & Shaukat, N. Sentiment Analysis of Product Reviews Using Machine Learning and Pre-Trained LLM. Big Data and Cognitive Computing, 2024; 8(12):1-18. [CrossRef]
- Sultan, L. R., Abdulateef, S. K., & Shtayt, B. A. Prediction of student satisfaction on mobile learning by using fast learning network. Indonesian Journal of Electrical Engineering and Computer Science, 2022; 27(1): 488-495. [CrossRef]
- Kiran, R., Kumar, P., & Bhasker, B. DNNRec: A novel deep learning based hybrid recommender system. Expert Systems with Applications, 2020; 144. uthor 1, A.B.; Author 2, C.D. Title of the article. Abbreviated Journal Name Year, Volume, page range. [CrossRef]
- Shuwandy, M.L.; Alasad, Q.; Hammood, M.M.; Yass, A.A.; Abdulateef, S.K.; Alsharida, R.A.; Qaddoori, S.L.; Thalij, S.H.; Frman, M.; Kutaibani, A.H.; Abd, N.S. A Robust Behavioral Biometrics Framework for Smartphone Authentication via Hybrid Machine Learning and TOPSIS. J. Cybersecur. Priv. 2025, 5, 20. [CrossRef]










| No. | Course-Card-Image | Course-URL | Course-Name | Course-Caption | Course-Instructor | Course-Price | Course-Reviews | Course-Hours | Course-Lectures | Course-Level | Course-Rating | Course-Classification |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | https://img-b.udemycdn.com... | https://www.udemy.com... | IBM Watson for Artificial Intelligence... | Build smart, AI, and ML applications and … | Packt Publishing | $69.99 | 82 | 15 total hours | 77 | Beginner | 3.4 | Cognitive Computing |
| 2 | https://img-c.udemycdn.com... | https://www.udemy.com... | Cognitive Behavioral … | Become a Certified Behavioral … | Kain Ramsay | $79.99 | 35548 | 31.5 total hours | 121 | All Levels | 4.6 | Cognitive Computing |
| … | … | … | … | … | … | … | … | … | … | … | … | … |
| 2237 | https://img-b.udemycdn.com... | https://www.udemy.com... | Mastering Employee … | Design & Implement Effective Employee … | GenMan Solutions | $19.99 | 106 | 2 total hours | 27 | All Levels | 4.2 | Speech Recognition |
| 2238 | https://img-b.udemycdn.com... | https://www.udemy.com.../ | Sentiment Analysis... | Sentiment Analysis | Taimoor khan | $49.99 | 92 | 8.5 total hours | 79 | All Levels | 4.2 | Speech Recognition |
| Study | Accuracy | F1-score | MSE |
|---|---|---|---|
| Proposed system | 91.25 | 90.77 | 0.10 |
| [8] | 81.9 | 79.4 | 0.18 |
| [25] | 88.2 | 86.7 | 0.13 |
| Participant | Ease of Use (Q1) | Relevance (Q2) |
Satisfaction (Q3) |
|---|---|---|---|
| Student 1 | 5 | 5 | 5 |
| Student 2 | 4 | 5 | 4 |
| Student 3 | 5 | 4 | 4 |
| Student 4 | 4 | 4 | 5 |
| Student 5 | 5 | 5 | 5 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).