MACHINE LEARNING AND DEEP LEARNING FOR SENTIMENT ANALYSIS OVER STUDENTS’ REVIEWS: AN OVERVIEW STUDY

Now when the whole world is still under COVID-19 pandemic, many schools have transferred the teaching from physical classroom to online platforms. It is highly important for schools and online learning platforms to investigate the feedback to get valuable insights about online teaching process so that both platforms and teachers are able to learn which aspect they can improve to achieve better teaching performance. But handling reviews expressed by students would be a pretty laborious work if they were handled manually as well as it is unrealistic to handle large-scale feedback from e-learning platform. In order to address this problem, both machine learning algorithms and deep learning models are used in recent research to automatically process students’ review getting the opinion, sentiment and attitudes expressed by the students. Such studies may play a crucial role in improving various interactive online learning platforms by incorporating automatic analysis of feedback. Therefore, we conduct an overview study of sentiment analysis in educational field presented in recent research, to help people grasp an overall understanding of the sentiment analysis research. Besides, according to the literature review, we identify three future directions that researchers can focus on in automatically feedback processing: high-level entity extraction, multi-lingual sentiment analysis, and handling of figurative language.


Introduction
As a result of the COVID-19 pandemic, many schools and universities have pivoted from traditional, in-person physical classes towards online courses. Technological development in recent years has led to improvements to the technology behind online courses, and as a result more students in developing countries and remote areas are able to take courses from top universities through their computer or mobile phone. This presents an opportunity for potentially reducing the education inequality across the globe [1]. There are already multiple platforms providing free online courses, such as Massive Open Online Courses (MOOC), Coursera, Khan Academy, Udemy and edX, providing lectures on a vast variety of subjects [2] [3].
Many online learners do not primarily rely on online courses to complete their courses, but to enhance the effect of conventional learning techniques, meeting new classmates or to review specific topics [4] As such it is not appropriate to use completion rates as an indicator to evaluate the effectiveness of MOOCs [5] for learners with other needs than completing their course and earning a certificate [6]. The authors in [7] make the point that the effectiveness of MOOCs could be mischaracterized if completion rates are overemphasized. This also includes drop-out rates [8,9] and other issues relating to the MOOCs [10].
It's therefore very important that learning institutions examine the feedback from students regarding their experiences with online learning platforms, so that both professors and the platforms themselves van learn which aspects need to be altered and improved. For example in [11] the paper indicates that feedback from students helped the platform to implement a co-creation process during the course of the projects life cycle. Additionally, professors can benefit from the feedback from students to help them understand student behavior and refine contents of the courses [12] [13].
Student feedback is usually structured to not only include close-ended questions, but also open questions allowing students to express their thoughts about various aspects of teaching [14]. It's important to examine students' sentiments about specific aspects of this feedback, as seeking out the opinions of others is a very common practice when it comes to decision making [15].
Often the amount of data from these student reviews will be so large that it would be impractical to manually process these reviews. There is also the potential for language barriers to complicate the task, for example with terms and abbreviations used by student-aged people on the web [16]. To analyze this sort of textual feedback accurately would require state-of-the-art technology, such as traditional machine learning or deep neural networks. Additionally, this task comes at a time when the practice of opinion mining is increasingly controversial.
On the one hand, the development of deep learning technology has made enormous contributions to the Natural Language Processing [17]. For deep learning models, there are several implementations of Neural Networks of Deep learning, such as the Convolutional Neutral Networks (CNN), Recursive Neutral Networks (RNN), Long Short-Term Memory Networks (LSTM), BERT and so on, which efficiently carry out the task of opinion mining. These deep learning models have been widely employed for opinion mining in various domains, including movies [18], social media platforms [19], e-commerce [20], eLearning [11], tourism [21], to name just a few. However, these sorts of deep learning models will usually need large amount of training data, to achieve competitive classification performance. Additionally, deep learning models usually have to pass through preprocessing steps as the sentences they work with are not structured for direct input of neural models. The preprocessing steps usually includes "tokenizing" and "normalizing". Tokenizing is to split sentences into smaller elements called tokens by using the common delimiters, and then converting the sentence attributes into a set of numeric attributes representing word occurrence information [22]. Normalizing is to replace words that have a similar meaning with a single word -for example could words like "studied", "studies", "studying" be replaced with the word "study". Stemming and lemmatization are two commonly used normalization techniques. On the other hand, there are multiple machine learning algorithms such as Bernoulli Naive Bayes, Support Vector Machine (SVM), LinearSCV, Random Forest, etc. Most of these algorithms can be found in the Python library scikit-learn. Before the data is fed into the neural network or machine learning algorithm, it is necessary to perform the pre-processing step in order to improve the performance of algorithms. Real-world statistics are often strident, incomplete and incompatible, it is important to pre-process and concentrate the data before the dataset can be used for machine learning [23].
Many research studies have been done on automatic processing of student reviews using both conventional machine learning algorithms as well as deep learning models [26][27][28][29][30][31][32][33][34][35][36][37][38][39][40]. Studies like these could play a vital role in improving various interactive [24] eLearning platforms [25], LMS, and MOOCs [26] by incorporating feedback from various sources into their automatic analysis. It will therefore be useful to perform an overview of recent research that has been done on the topics of different algorithms of machine learning and deep learning.
The rest of this article will be structured as following: Section 2 will present a review of the relevant literature on these topics. Section 3 discusses the challenges inherent to classification and the need to solve them in the future. In section 4 I will present possible avenues for future research. Section 5 will be discussion about the work.

Literature Review
In the past ten years, with the rapid development of computation capability of graphics, and more educational resources [27] and frameworks [28], there are more and more sentiment analysis research based on deep learning and traditional learning algorithms.
Sindhu el. [29] built a learning model with two LSTM layers to analyze the sentiment polarization of students' reviews. The two layers works as different classifiers. One layer is used for aspect extraction while another is for classifying the sentiment of the reviews as positive, negative or neutral. The aspects from the output of the first layer would be used as the input of the second layer. The public dataset comprising of restaurant reviews is used to test the model trained by students' reviews from reviews of physical classroom courses of the university. The results of the research indicate that the two-LSTM-layer model used in this paper gets 91% accuracy in extracting aspect and 93% in classifying sentiment of the students' reviews. For the public restaurant reviews, the model achieves 82% accuracy in extracting aspects and 85% accuracy in sentiment classification. The scientific paper [14] compared the conventional machine learning and deep learning algorithms on sentiment analysis based on 21940 students' reviews from e-learning platform. During their experiments, they used 4 traditional machine learning algorithms: SVM, Naive Bayes, Boosting, Decision Tree, and built one 1D-CNN deep learning model to extract aspect and analyze sentiment. Their research results show that 1D-CNN achieved better performance on sentiment analysis getting 88.2% on F1 scores but traditional machine learning models are better at aspect extraction.
Anna el. [30] conducted a survey containing several free text questions and got 204 answers from the survey. They classified the 204 students' feedback into 162 positive reviews and 42 negative reviews. In their experiment, the traditional machine learning algorithms Naivebayes and K-nearest algorithms to predict the students' reviews are positive or negative. Besides, cosine similarity is utilized to measure the similarity and leave-one-out cross in order to validate the result. The authors compared the results from their research with the Recursive Neural Tensor Network (RNTN) [31] method. The paper [30] indicates that RNTN [31] has better performance in Precision but worse Recall and Accuracy.
Katragadda et al. [23] researched sentiment analysis using several supervised machine learning algorithms and one deep learning model in order to classify the feedback as positive, negative or neutral. Their dataset includes thirty thousand feedback containing anonymous personal information, reviews, and students' emotion. Also, the dataset is classified into two categories: linear dataset with same properties, and non-linear dataset. The article shows that machine learning models get better results on linear dataset. More specifically, Naive Bayes model gets 50% accuracy after the precision and recall are calculated inside of it. Beside, SVM algorithms achieves 60.8% accuracy and the deep learning model gets 88.2% accuracy much beyond the conventional machine learning algorithms.
In the research job [32] the big data mining framework were built to achieve the instant monitoring of the students' satisfaction on online learning platforms. The framework contains both Data Management and Data Analytic techniques aiming to fill the gap between limited focus of big data on educational fields and rapid development of big data techniques. The framework is made to be able to connect to various kinds of data sources easily. The discussion of the classes in forum and stuents' reviews are used as 2 main data sources used in this research. The significant part of the their framework is called Analysis Engine which is used to analyze sentiment, cluster, and classify the reviews.During the pre-processing steps, the text features are collected using the below TF-IDF [33] equation: where TF is the number of keywords occurrences in the current processing file, N is the counting of keywords occurrences in all files, and DF is the counting of total files in the experiment. The dataset contains 15000 balanced textual reviews. Then the balanced dataset is fed into two machine learning models Linear SVM in order to train the machine learning models, and Cross Validation is used to validate the results. The statistical supervised learning algorithm (SVM) that "one-against-one" strategy has been used with a "max wins" selection strategy. When clustering the dataset, TF-DF algorithm is implemented by the authors to transform textual reviews to number firstly. Then they the K-means models to extract clusters for survey form and feedback collected from the forum. Finally, the controlled experiment involving few e-learning students and a small test data is conducted showing the functionalities of the framework and its potential value. Besides, the authors point out the future direction of the research. On the one hand, the connection between sentiment of the lessons and students' final mark could be built. On the other hand, other information, e.g. login information, admitted lessons, as well as contents posted on social media could be incorporated. The research work in [6] investigated the factors influencing students' MOOC satisfaction and give us an extended general understanding of those factors. In their research, students' satisfaction is regarded as an crucial metric defining the success of MOOC. They classify the independent variables into learner-level and course-level variables and utilize those two aspects variables to predict students satisfaction. How those independent variables of student and course level affect the dependent variable, -MOOC students satisfaction , is evaluated by the authors. The dataset is downloaded from a public course website Class Central where people are able to download class metadata and feedback regarding the respective course. They collected feedback from 6391 students from this website. During the experiment, several traditional machine learning models, e.g k-nearest nerighbors regression [34], gradient boosting trees [35], support vector machines [36], logistic regression [37], and naive Bayesian [38], are used to classify the emotion polarization. The algorithm achieving the best performance among all models was then chosen to predict aspect labels for the left part of unlabeled reviews. Their research results show that the gradient boosting tree got best results among all traditional machine learning models. As for the computation of sentiment polarity scores of the input text, the TextBlob3 is utilized that is a public free text processing software. The output scores for each review from TextBlob3 ranges from -1.0 to 1.0. The authors identified three crucial factors having statistically strong associations with learner satisfaction regarding learner sentiment in the conclusion part, which are content, assessment, as well as instructor. But there are no directive connections between ourse structure, video, and interactions and MOOC students' satisfaction [6]. There are two disadvantages of this sentiment analysis research. On the one hand, eighty percent of the feedback from the Class Central is written by those students who finished the whole courses. On the other hand, because of the intrinsic difference of the data the real randomness between the dependent variable -MOOC satisfaction, and those independent variables was not addressed.
Lwin et al. [22] conducted the research on not only open text reviews but also on rating scores. The dataset is gotten from the online survey form from the students of the university, which all the questions are just rating value questions except the last question that is free text based and used to collect feedback on classes and teachers. In textual comments analysis, the textual feedback were classified into two categories: negative, and positive, while rating scores were classified into five types: Worse, Bad, Neutral, Good and Excellent. The labeling work of the dataset utilizes the K-means clustering algorithms to pre-label the huge amount of the feedback data. Then the labeled dataset is fed into multiple conventional machine learning models. The six algorithms, -Logistic Regression, Multiplayer Perceptron, Simple Logistic Regression, Support Vector Machine, LMT and Ransom Forest, are selected to make comparisons in terms of performances. The situation is different for the sentiment analysis of textual comments. Firstly, people label each sentence as positive or negative manually, and conduct pre-processing steps on those reviews using the open-source library NLTK in order to analyze textual comments. The authors conclude that SVM gets the best results on rating score classification and Naive Bayers algorithms yields best performance for textual comment analysis.
The research work [16] conducted the experiments using 8 conventional machine learning models, 5 deep learning models and one evolutionary model.These fourteen algorithms they have used in their research have been presented in the Table 1. Two different kinds of dataset that is crawled from the HTML code of YouTube and other online learning platforms, is used: eduSERE and SentiTEXT. eduSERE is able to represent learning-center sentiments like engaged, excited, disappointed, and bored while the other is only with two polarities: positive and negative. The authors then build a sentimental dictionary connecting between words in text and emotion of the reviews. An algorithm based on the word count to classify the sentiment and the learning-centered emotion is proposed in order to pre-label the feedback they have collected. There are some reviews which were hard to classify and so were removed by the authors when checking the pre-label results. Finally, they choose the accuracy as the metrics to evaluate the performance of the models and algorithms. Their research presents that BERT and EvoMSA get better performance with 93% accuracy on SentiTEXT classification and descent accuracy of 84% and 83% on EduSERE classification. In the last, the integration between the models and an intelligent learning environment is performed by them. The intelligent learning environment is developed using Java. On the last part of the article, the authors concluded that genetic EvoMSA model have the best performance after adopting more knowledge and being optimized by macro-F1 aiming to solve the unbalanced dataset problem.
The researchers came up with a fusion deep learning model in [39] to analyze learners' reviews. The data set is the Vietnamese Student Feedback Corpus (UIT-VSFC) [40], which contains 16,000 reviews from Vietnamese students, which are then machine-translated into English for later use. The fusion model contains of a multi-head layer and an LSTM layer, and its structure is shown in the Fig.1. At the beginning, the feedback is fed into two different training embedding: Glove embedding and Cove embedding. Through embedding, the word information in the sentence would be changed to position information. Then use multiple attention blocks to calculate the weighted sum of multiple attentions rather than only concerning a single attention. The attention mechanism is used to assign weights to context words to find the words that determine the sentiment of the input sentence. Then the researchers utilize different dropout rates in order to avoid over-fitting problem, as well as combine the outputs of two different embedding models (1024 features) as the input of LSTM. Finally, through dropout and intensive layer, one of three emotions is classified: positive, negative, and neutral. The results show that the performance of the proposed multi-head attention fusion model is better than the single LSTM model, LSTM model + attention model and multi-head attention model.

Research Challenges
In the paper [22], the authors point out that many students would not to give directive negative reviews to the lessons or teacher even though there are difficulties with studying process. The learners are used to give many good feedback firstly and point out the shortcomings of the aspects at the end of the review. So many feedback may contain two aspects in only one review. e.g., considering such a review: She is extremely good at the techniques of computer science but becomes impatient easily. The above feedback evaluates both the teacher's knowledge and behavior. The authors in [29] indicate one of the reasons regarding misclassifications could be the occurrence of more than one aspect within one feedback because of the lack of the connectors e.g. "but", "and". So often this kind of review would be classified with only one aspect. This is a major setback when calculating the accuracy of the system. There are some public and free library such as OpenNLP that is able to break a review sentence on connectors. But when there were no any connectives such as "AND" & "BUT", those tools would not work well anymore. There is a research study [41] trying to use BERT to solve this multiple aspects problem. The model is called target-dependent Bert(TD-BERT) [41] using the the positioned output at the target terms instead of only the first tag since one review may have multiple aspects with their own background. After that, there is a max-pooling layer before the output is used in the next fully-connected layer. The D-BERT model is able to extract several aspects from one review at the same time and then analyze the sentiment of the feedback by combining predicted aspects. Their research study shows there is not too much value if just combining TD-BERT with complicated neural network, sometimes even leading to worse results than vanilla BERT-FC (fully connected). Also, the accuracy is improved when the aspects information is adopted. The study of [17,42] indicates that it is not common for models paying much attention to semantics. We can see that BERT and other deep learning algorithms are not all good at natural language processing tasks, not much when it comes to natural language understanding 1 . Such issues can be addressed employing ontologies, better vector space representation models [43,44], and objective and semantic metrics [45,46]. Also, the model that are able to handle multi-lingual reviews is absent and most current sentiment analysis model can only process reviews written in English. For example, the paper [22] points out that Myanmar language also appears in the students' feedback. Machine-translation translating all reviews to English is used in their paper but the sentiment is also lost a lot during translation process. The authors in [39] also just translated all the Vietnamese reviews into English by using machine translation. The above examples demonstrates there is a lack of effective model to handle multi language feedback which is still a research challenge that needs to be solved.

Future Directions
• Better automatic high-level entity extraction models [47]. For example, teacher, lesson, content and fine curriculum including lesson structure and teacher experience. As mentioned above, a comment may contain more than is already defined in the model. The authors of [47] point out that one of the main challenges of natural texts is to extract entities mentioned in texts, which can be named entities that refer to individuals or abstract concepts. In most study of sentiment analysis, the research conduct survey in advance to get all possible aspects and select the most common aspects occurring in the reviews. How to automatically extract important aspects is the focus of researchers in the future.
• Extend the most advanced model to include multilingual sentiment analysis. Most student reviews are not in English and may involve multiple languages. For convenience, many research just use machine translation to translate all comments into English, and emotions may be lost in the translation process. Therefore, it is necessary to implement a multi-language processing model.
• Ability to handle figurative language such as sarcasm, irony. According to [48], irony and sarcasm are sophisticated forms of speech in which authors use the language signifying the opposite meaning. Even for humans, the recognizing irony can be quite difficult and complex, which leads to much misunderstanding in our daily life. It is a challenging work to detect sarcasm and irony for natural language processing, especially for sentiment analysis. Data mining activities and result can be misled by sarcastic sentences to wrong classification [49]. The authors in [50] built the model using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. In the future, the slang dictionary and the emoji can be integrated to the model to better detect sarcasm or irony to get more accurate sentiment analysis results. Able to deal with figurative language e.g. satire and satire. In the paper [48], irony and satire are a complex form of speech, and the language used by the author expresses the opposite meaning. Even for humans, understanding irony is quite difficult and complicated, which can lead to many misunderstandings in our life, work and study. In natural language processing, especially in sentiment analysis, how to find irony and irony is a challenging task. Sentiment analysis and data mining results can be misled into incorrect classifications by satirical statements. In the book [50], millions of emojis are utilized to build the model aiming to learn any domain notation to detect emotions, sentiments and sarcasm. In the future, slang dictionaries and emoticons can be adopted into the model to get better results in detecting irony.
• Addressing data imbalance problem by adopting GAN. Using GAN network to generate students' reviews is also a future direction to improve accuracy of sentiment analysis. Deep learning method usually requires huge amount of datasets while the datasets acquired from real world is generally unbalanced. For example, there may be more reviews about teacher or subject which may influence the accuracy of training result, or there are often more positive reviews while no enough negative and neutral reviews.

Conclusion
In this paper, we conducted a literature review study of sentiment analysis research using machine learning methods and deep learning methods. The research scope we investigated is restricted on educational field and based on students' reviews containing feedback both from physical classroom and online learning platform, such as MOOC.
In the beginning of the paper, we introduced the reasons why it is necessary to analyze the feedback of students for both conventional and online learning settings. It is an effective way to get effective suggestions for both teachers and platforms so that both can know where they need to improve to give better teaching experience, especially when the whole world is still under pandemic and many schools have transferred the physical teaching to online teaching. Then the common techniques of automatically processing reviews are also introduced including traditional machine learning and deep learning. For the machine learning methods, Bernoulli Naive Bayes, Support Vector Machine(SVM), Linear SCV, Random Forest, Decision Tree, K-means clustering are the common used algorithms to classify the sentiment of students' reviews. Most of these algorithms can be implemented easily using Python library scikit-learn. For the deep learning learning models, RNN, LSTM, CNN, BERT, Word Embeddings, TF-IDF, BERT are the common and popular techniques or models.
Finally, further research directions arising from this article concern the better models for automatic high-level entity extraction because one review often contains more than one entity. Besides, the ability to handle figurative language such as sarcasm, irony can significantly improve the sentiment analysis accuracy since figurative language signifies the opposite meaning and human sometimes even cannot recognize the irony correctly in sentences. Finally, in terms of multiple languages challenges, further analysis could be devoted to implement model that can handle reviews in multiple language since sentiment may be lost during translation by machine.