Effect of Non-Academic Parameters on Student’s Performance

With the exponential growth in today’s technology and its expanding areas of application it has become vital to incorporate it in education. One such application is Knowledge Discovery in Databases (KDD) which is a subset of data mining. KDD deals with extracting useful information and meaningful patterns from the database that were not known before. This study is a detailed application of KDD and focuses on analyzing why a particular set of students performed better than others and what factors influenced the results. The study is conducted on a dataset of 480 students and across 16 different features. The authors implemented 4 major classification techniques namely Logistic Regression, Decision Tree, Random Forest and XGB classifier. Obtaining the key features from the top performing ML algorithms that have a major impact on the performance of the student, the study takes these features as a baseline for further analysis. Further data analysis highlights patterns in the data. The study concludes that there are a lot of non-academic factors that influence the overall performance of a student and should be taken into consideration by universities and other relevant bodies.


I. INTRODUCTION
Educational Data Mining (EDM) is a small but significant subset in the trend of Data Mining and Knowledge Discovery in Databases. EDM concentrates on the development of techniques to exploit the data from educational databases. The main aim is to detect useful patterns from student databases. These databases contain information such as academic performance, gender, financial conditions, etc. Previously unknown patterns, relationships, mathematical algorithms, and statistical models are generated from the data for implementation in the educational system for the overall betterment of the student. As the dataset is specific to the educational area and thereby having intrinsic semantic information, relationships with other data, and multiple levels of meaningful hierarchy [2]. Researchers in this field focus on discovering useful knowledge either to help the educational institutes manage their students better or to help students to manage their education and deliver better and enhance their performance [1]. The target outcome of EDM can be roughly classified into 4 major categories namely; improving student performance models, improving domain models, studying and strengthening the pedagogical support provided by the learning software or mentor, and scientific research into learning and teaching [2]. A lot of work has been done regarding EDM on web-based learning and distance education and recent trends are more focused on intelligent platforms that evaluate and guide students on what to learn based on their interests [3]. The primary goal has always been to equip students with the knowledge and skills needed to transition into successful areas within a specific period and EDM helps achieve that [5]. A similar approach of ensemble methods was implemented by the author in [4] that used Random Forest, Artificial Neural Networks, and Naïve Bayes. In [6] the author used a wrapper-based feature selection method called Boruta.

II. RELATED WORK
In the past years, researchers have tried to implement the developed algorithms for EDM purposes. Over time, these algorithms and techniques have evolved for better performance. In a paper published in 2003, the authors [8] showed considered factors affecting students' dropout rate. These factors are conditions related to the students before admission, factors related to the students during the study periods in the university, and all factors including the target value to be predicted for factor analysis. The authors used a tree-based classification algorithm, J48 or C4.5, and Naïve Bayes to analyze the data. A study conducted by Ibtissem Daoudi Et al. [15] in 2021 using Crisis Management Serious Games (CMSG) has shown its potential for teaching people both technical and soft skills related to managing crises in a safe environment while reducing training costs. In summary, various researches [9] [10] [12] investigated to solve the educational problems using data mining techniques. However, very few researches shed light on student's behavior during the learning process and its impact on the student's academic success [16].

III. DATASET
The dataset in this paper has been taken from Kaggle [17] [18] which is a huge repository of datasets that is available for training machine learning algorithms. The database features can be roughly divided into 3 categories that are : (i) Demographic features such as gender and nationality, (ii) Academic background features such as educational stage, grade level, section, etc, and (iii) Behavioral features such as raised hands, visited resources, answering survey by parents and school satisfaction. In this paper, the main analysis will be focused on the "class" feature of the dataset. Apart from that, various parameters will be analyzed with respect to gender and factors influencing the "class" i.e the overall performance of the student.
The first step in pre-processing the dataset and preparing it for analysis is checking for null values (Fig.1) and then converting the dataset into a machine-readable format as the algorithms won't be able to generate results from nonnumerical data and will fail to converge. The data has been converted into a working format using various encoding techniques such as binary encoding, ordinal encoding, and one hot encoding. For such encodings to work we first had to select the features that best fit these categories for encoding. The features that had non-numeric values were selected and broadly classified into 3 categories. The first category was binary that contained the features whose value varied into 2 values like "gender" (M/F) or "semester" (S/F). The second category was ordinal that contained the features in which order of the data mattered like "StageID" or "GradeID". The third and final category was nominal that contained nominal features in which there are more than 2 values but the order does not matter like "Nationality", "PlaceofBirth" etc. And the target column was selected as "class" (Fig.2). The encoding functions used are shown in Fig.3 and some sample encoding results are shown in Fig.4. In Fig.4 "gender" feature was encoded according to the binary encoding, similarly, "StageID" and "GradeID" were encoded according to ordinal encoding. After proper preparation of the dataset for analysis, it's important to analyze the dependence of features on one another. There are various methods to analyze the interdependence of features but the most widely used is plotting a heat map. Fig.5 is a heat map that shows the relationship between various features and makes it better to visualize features of the dataset. It also gives a broad and basic understanding of the database and highlights the key features for further analysis.

IV. MACHINE LEARNING ANALYSIS
There are various techniques available for data mining which are also used in knowledge discovery in databases (KDD) such as classification, clustering, association rule learning, A.I., etc. Classification is one of the most important and widely used data mining techniques. Researchers use and study classification because it is easy to use [1]. This paper is going to focus on 4 classification algorithms namely Logistic Regression, Decision Tree, Random Forest, and XGB Classifier. For evaluating the performance of the algorithms we have used 5 parameters and compared the results of various classification algorithms based on these parameters. The parameters used are (i) Precision (refers to the fraction of relevant instances among the retrieved instances), (ii) Recall (refers to the fraction of relevant instances that were retrieved), (iii) F-1 Score (refers to the weighted average of precision and recall), (iv) Support (refers to the number of actual occurrences of the class in the specified dataset) and (v) Accuracy (refers to the percentage of correct prediction of test data).

A. Logistic Regression
Logistic regression is a Machine Learning algorithm that is used for classification problems. For the dataset used in this paper, the training set and the test set were from the same database and were divided in a ratio of 70:30 respectively. This criterion remains the same throughout different algorithms. Selecting logistic regression as the base algorithm the results obtained are shown in Fig.6.

B. Decision Tree
A decision tree is a supervised learning technique that can be used for both classification and regression problems but mostly used for solving classification problems. The results for running the decision tree algorithm on our dataset are depicted in Fig.7.

C. Random Forest
Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset. The results for running the random forest classifier on our dataset are depicted in Fig.8. For the parameters, the random state was chosen to be 52 and the number of estimators was 150.

D. XGB Classifier
XGB is a decision tree-based ensemble ML algorithm. It works on a gradient boosting framework. For our dataset the parameters used were max-depth=4 ; learning-rate=0.10 ; n-estimators=50 and seed=52. The results are depicted in Fig.9. Table I shows the results (accuracy) of all the algorithms used. From the comparison of accuracies of different algorithms, it is clear that Random Forest and XGB Classifier performed best. The accuracies of Random Forest and XGB are remarkably similar. This gives the pathway for selecting two algorithms for further analysis.

E. Results
After the generation of accuracies, in order to analyze the key values in the dataset, let's plot the feature importance graph for both the best-performing algorithms and compare them. Fig.10 shows the feature importance for the Random  Forest Classifier and Fig.11 shows the feature importance for the XGB Classifier.  Table II shows the comparison between the top 5 features for both algorithms. Most of the features are roughly the same but for better visualization purposes we will take the features that are common in both and understand their impact on the overall grades i.e the "class". The features that will be taken into consideration are: Visited resources, Raised hands, Discussion, and Announcement views. Over the years researchers have used various techniques for data analysis and visualization. There are lots of different methods for visualizing data like swarm pots, bar plots, graphs, heatmaps, etc. As the ML analysis showed the important features that impacted the overall performance i.e. the "class", we are going to keep them as the baseline for our analysis. Also we are going to keep our study focused on class "2" because that's the highest class and consists of top performing students.  Fig.12 shows a plot of performance gender wise. It gives a clear picture of who performed better. As it is evident from the graph that for classes "0" and "1" male students performed better compared to the female students. But in the final class "2" i.e. the highest one, females performed better than males. This highlights the path for further analysis. We need to analyse the reason behind this result and why female students scored more than male students.
Starting with one of the important features of our dataset i.e. the number of "visited resources" let's have a look at the visualization results. Fig.13 is a swarm plot that gives a rough picture of the distribution of students class wise and gender wise over the feature of visited resources. As we can see for class "0" most students are concentrated towards the lower part of the graph and for classes "1" and "2" students are concentrated towards the upper part of the graph. When examined closely we can see that for class "2" the concentration of females (depicted Fig. 13. Comparison on visited resources. Note the concentration of students in the upper bound of the distribution for classes 1 and 2 by orange dots) is high in the upper bound. This plot depicts that if the number of resources visited by the students are more, then their overall performance will increase and they will score a better class. Let's now look at another key feature that affects the class of our students i.e. "raised hands". Fig.14 is a swarm plot that shows the distribution of the number of students class wise as well as gender wise relative to raised hands. It is clear from the plot that distribution across class "1" was roughly even but concentration of male students was more compared to female students. But if we look at class "2" the distribution is highly uneven and concentrated more in the upper part of the distribution. Also the frequency of female students is more compared to male students. We noticed a similar relationship in the previous result of visited resources.
Let us study the remaining two key factors that are "Discussions" and "Announcements View" and comment on their results. Fig.15 shows a swarm plot of "Discussions". Fig.16 describes another swarm plot of "Announcement View". Both of these swarms plots show a basic difference between the performance of both the genders. If views both display similar features i.e. more frequency of females in the higher class. But there is a high similarity between class "2" of both the features.
An interesting observation on the results "Discussions" and "Announcements View" is that, both of them have roughly the same frequency of male and female students in class "2" as compared to the previous results. In the case of "raised hands" and "visited resources" the frequency of distribution of female students was fairly more compared to the male students in class "2". Apart from that the pattern of distribution is roughly the same and that makes it an unique observation. Let's look at these features from another perspective of visualization.  Fig.18 display a joint plot that shows the distribution via contour maps as well as frequency distribution across the range. Fig.17 refers to the "Announcement View" feature and Fig.18 refers to the "Discussion" feature. If we examine the "Announcement View" plot we can see that in the horizontal graph, towards the end i.e. from 1.5 -2.5 the frequency distribution of male and female students overlap. This is the same case for the "Discussions" feature. In the horizontal plot, towards the end, the distribution overlaps for both genders. But when we take a look at the contour map of "Announcement View" we can see that for class "2" the number of female students is more compared to class "1" of the same distribution. Also for the contour map of "Discussions", we can see similar results.  analysis. Authors conducted a detailed study with all the key features and compared their results among each other. For all the features authors found a common trait that if the student is a high performer then the values of key features were bound to be high. Throughout the study authors saw that the number of females were high in the upper bound of the key features. This resulted in female students performing better and achieving a higher class compared to their male counterparts.

VI. CONCLUSION
After the completion and a thorough analysis of the study we can conclude that for being able to perform better and score a good, students need to focus on the key aspects of their fields. In the current study it was also found that student's performance also depends on some other non-academic factors such as "student absence days", "parents answering survey", "Nationality" etc. To conclude, the main aim of this study was to focus on various factors that affect students and their performance so that they can improve. Also the study aimed at motivating more research in the field of data mining and finding interesting patterns which could help the students and universities in a constructive way.
Using this database and this study one can find more information and better patterns. A more interesting approach would be to include Convolutional Neural Networks for training and then generation of results. Similarly other A.I. oriented approaches could be used for analysis. Universities can make data mining an integral part of their evaluation scheme which will grant them the ability to make correct decisions in favor of the students.