2.4. Data Preprocessing:
Data preprocessing describes the process of when the raw data is prepared for training and testing the ML model. As a first step in protecting student confidentiality, we recommended that student records are assigned a randomized ID, and the list of names along with the assigned random IDs be stored safely per IRB standards at your institution. This can be easily done in Python using a for loop with a random number generator function from the random module which is built-in to Python. A for loop in Python is a line of code which repeatedly uses, or iterates, a function, and in this case, we repeat the function 400 times. This is equal to the number of student records in our dataset. All python code for this step are available at
https://github.com/RUSVMCenter4/Veterinary_Education_ML_Tutorial and shows how to generate random student IDs and then add the IDs as a column to the dataset. This step is not included here, because we simulated the entire datasets. Additionally, in this manuscript, we do not address all steps and considerations when making a dataset completely anonymous, and therefore we recommend an expert is consulted if outcomes of the ML project are made public.
We use the random forest algorithm provided in the Python package scikit-learn [
31]. To begin to prepare the datasets for this ML algorithm, categorical data such as gender and ethnicity/race must be converted to numerical values. This is accomplished through one-hot encoding or dummy encoding. One-hot encoding ensures a rank is not assigned to categorical variables while the variable is converted into numerical data. One-hot encoding adds a new binary variable for each unique categorical value and the original encoded variable is removed (
Figure 4). Dummy encoding uses binary variables and creates the number of columns equal to the number of categories minus 1 (
Figure 4).
#To one-hot encode for race column, use the get_dummies() from the pandas package
#We assign this transformed data to a new variable called dataset_OneHot
#We also need the argument drop_first to not be true in order to perform one-hot encoding.
dataset_OneHot = pd.get_dummies(dataset, columns=["Race"], drop_first=False)
dataset_OneHot.head() #To view the first several rows and column names
It is important to note that Python functions have different arguments or parameters, and when these arguments have an assigned value, they are used when the function is performed. To perform dummy encoding (
Figure 3B), we will need to set the argument drop_first to true.
#To dummy encode the gender column
dataset_OneHot = pd.get_dummies(dataset_OneHot, columns=["Gender"], drop_first=True)
print(dataset_OneHot.head()) #To view the first several rows and column names
For continuous variables such as age, pre-admission GPA, and GRE, scaling the variables, commonly known as feature scaling, or standardizing the variables may need to occur. When employing feature scaling techniques, our goal is to make sure that all the variables are on the same scale or nearly the same scale. This will not change the distribution of the data. This means this step will not transform non-parametric data into normally distributed data. For example, within our dataset age ranges from 20 to 40 whereas GPA ranges from 3.00 to 4.00 and GRE ranges from 260 to 330. If we were using a ML model that was unsupervised and cluster based upon relationships and groupings [
7], leaving these values unscaled, could impact the results due to many models being based upon Euclidean Distance, or the distance between two data points.
We kept age as a continuous variable as random forests are not a distance-based classifier, robust to outliers, and does not need parametric data [
22]. We also kept pre-admission GPA and GRE scores continuous because random forests handle high non-linearity between independent variables [
32]. Non-linear parameters typically do not affect the performance of the decision tree models because the splitting of the decision nodes is based upon absolute values, “yes” or “no” (
Figure 4), and the branches are not being based upon a numerical value of the feature [
19,
20,
22,
32,
33].
GRE is a continuous variable, with two of the simulated datasets containing missing GRE scores. It is important to determine if the data is missing not at random (MNAR) due to a specific reason (i.e. a student does poorly on the GRE and so does not report the result), or if the missing data missing at random (MAR) or missing completely at random (MCAR). MAR data is missing and while randomly missing, can be explained by another observed variable (i.e. a dataset contains information on medical absences and course exam grades, a student with a missing course exam grade could be explained if they had a medical absence). MCAR data refers to missing data that is randomly distributed across the variable and is not related to the other variables (i.e. if the dataset contained a few student records without GPA scores due to human error inputting the scores into the student record database). GRE was chosen as an example variable for missing data, because the GRE is recommended or no longer required for many veterinary professional programs as there are concerns the GRE may hinder diversity and inclusion efforts and may be a burden for low-income students [
34,
35]. How the veterinary educator decides to handle this data will impact the results of the ML model. This will be shown when reviewing the results of the ML models created using the three datasets.
Unfortunately, currently there are no established methods for handling MNAR data in medical education, and therefore there are no guidelines on imputation methods, or the action of replacing missing values in the dataset with an alternative or predicted value. Our code below demonstrates loading 1 of the 2 biased GRE datasets and use listwise deletion (deletion of the student recording containing NAs) and substitution (the mean or median value for the column) [
36,
37] which are two common methods reported in education and often are the default methods in R and Python packages.
#Import the first dataset with missing GRE values using the function pd.read_excel().
Biased_dataset = pd.read_excel(r'C:\location_of_data/name_of_excell_datafile.xlsx', sheet_name='name')
#Code to drop delete each student record that does not have a GRE score reported
#The “empty” GRE value will be noted as an “na” in Python, therefore we use the dropna()
#The argument axis=0 means the row with the “na” will be dropped.
#The argument how=’any’ means that any “na” will result in the row being deleted
#The argument inplace=True means that a new dataframe will not be created
biased_dataset_OneHot.dropna(axis=0, how='any', inplace=True)
#Code to replace each missing GRE score with the mean of the GRE value
biased_dataset_OneHot.fillna((biased_dataset_OneHot['GRE'].mean()), inplace=True)
The full code to create the base random forest models using the two missing GRE datasets is available on Github. These datasets with missing GRE were created to illustrate NMAR and MCAR data and how commonly accepted methods for handling these data types in the literature have the potential to affect the model; therefore, only base random forest models were created and no hyperparameter search was conducted.
2.6. Model Creation and Performance Evaluation:
ML algorithms learn from the input dataset which is typically divided into training and testing datasets. The training dataset is used to train the ML model followed by evaluating the model with the testing dataset. Splitting the dataset is commonly done to help reduce the risk of over-fitting. If over-fitting occurs this means the model only performs well on the data used to train it, and the model’s performance is reduced when it is applied on new data [
41]. We recommend that multiple models are trained with different parameters and compared to find the best candidate model for the identified educational research question. This is commonly done by what is termed as a ML pipeline. The parameters within the pipeline can be adjusted in any of the steps, (1) data pre-processing, (2) feature extraction, (3) model training, (4) model evaluation. Model evaluation is completed by comparing a variety of different metrics which may include accuracy, F-scores, receiver operating characteristic (ROC) curves, and others which we will expand upon more in the experimental section.
2.6.1. Generation of Base Random Forest Model:
The first step of creating a ML model, including random forests, is to take the dataset and separate the variables into one dataframe (a table with rows and columns) and the target column in a second dataframe. The target column is the outcome we desire to predict. Within our example we are creating a classification model, so the outcome is a binary outcome, either a “Pass” or a “Fail. In the “Fail” target column a 0 indicates a student did not fail a course and 1 indicates a student failed a course.
#X is our variable dataframe and y is our target dataframe
#create dataframe without target, [rows, columns], the : indicates to select all rows
X = dataset_OneHot.loc[: , dataset_OneHot.columns != 'Fail']
y = dataset_OneHot['Fail'] #target variable for prediction
We are most interested in determining what variables lead to the outcome or target column. Considering that the simulated dataset is composed of a majority and minority class (target variable does not have an equal number of 0s and 1s) means that the dataset has an imbalance bias; one of the three main types of recognized biases in ML [
42]. The two main approaches to deal with imbalanced target variables is to use an oversampling technique which creates additional minority classes (student records with a course failure) or undersampling techniques to randomly delete majority classes (student records without a course failure). We selected to turn our dataset into a balanced dataset by using an oversampling method called synthetic minority oversampling technique (SMOTE) which is a common method for dealing with models predicting student success in higher education [
43,
44,
45]. After performing SMOTE, our dataset will result to students who failed a course equal to the number of students who did not fail a course.
#Import required python function of SMOTE from Python package imblearn
from imblearn.over_sampling import SMOTE,
#Oversampling to allow 0 and 1 target to be equal
#Assigning a value to the random state argument ensures that anyone can generate the same set of random numbers again
X_resampled, y_resampled = SMOTE(random_state=23).fit_resample(X, y)
With a balanced target variable, we are now ready to split our dataset into training data and testing data. Splitting the dataset is commonly done to help reduce the risk of over-fitting. As our code shows below, we used train_test_split() from scikit learn Python package (version 1.1.1) to split our dataset. We also provide code showing the data structure as the length dataframes containing the variables must be the same length as its counterpart containing the target, otherwise the random forest model will not be constructed.
#Import required functions:
from sklearn.model_selection import train_test_split
#Use the balanced data to create testing and training datasets with 70% of the data being training and 30% of the data being testing.
X_trainSMOTE, X_testSMOTE, y_trainSMOTE, y_testSMOTE = train_test_split(X_resampled, y_resampled, stratify=y_resampled, test_size=0.3, random_state=50)
#Check sizes of arrays to make sure it they match each other
print('Training Variables Shape:', X_trainSMOTE.shape)
print('Training Target Shape:', y_trainSMOTE.shape)
print('Testing Variables Shape:', X_testSMOTE.shape)
print('Testing Target Shape:', y_testSMOTE.shape)
Once we have our training and testing dataset, we are ready to construct the base random forest model with the arguments, or parameters, being left at their default values. The model must first be built and then trained using the training data.
#Import required functions:
from sklearn.ensemble import RandomForestClassifier
#Build base model without any changes to default settings
forest_base = RandomForestClassifier(random_state=23)
#Train the model via fit()
forest_base.fit(X_trainSMOTE, y_trainSMOTE) #using training data
2.6.2. Evaluation of The Base Random Forest Model:
Once the base ML model is created, then we evaluate how well the model performs when it is shown new data, the test data. ML model performance can be assessed using different techniques which are discussed individually below. Each of the model performance methods use a range from zero to one, with one indicating perfect performance and zero indicating all predictions were wrong. We recommend that cross-validation (CV) be included in each of the performance metric calculations.
CV is used to help detect overfitting while evaluating the performance of the ML model on new data. We used k-fold cross validation, which is one of the most common cross validation techniques [
46]. In k-fold cross validation, the dataset is divided into folds (consider these as subset datasets) based upon the assigned k-value. One fold is saved as a validation dataset and the other folds are used to train the model. As
Figure 5 shows, this process is repeated multiple times with each repeat holding out a different fold for validation of the model. The results from each validation set is averaged to produce the final performance value. Typical k-values are 3, 5, and 10, but there are no established rules guiding the selection of these values.
To prepare our model for the performance metrics, we must define the k-value for the CV and input our testing dataset into the trained random forest model. The trained random forest ML model will predict the target category or outcome of whether a student failed a course or not based upon what it learned from the training dataset.
#Make predictions using testing data set
y_predictions = forest_base.predict(X_testSMOTE)
y_trueSMOTE = y_testSMOTE #Rename the test target dataframe
#Import required function
from sklearn.model_selection import KFold
#Defining the cross-validation to be able to compute the performance metrics using the k-fold CV
kf = KFold(shuffle=True, n_splits=5)
Overall accuracy, recall (aka sensitivity or true positive rate), specificity (aka true negative rate), precision, F-score (aka F1-score), and receiver operating characteristic (ROC) curves are performance metrics which are computed using the true negative (TN), false negative (FN), false positive (FP), true positive (TP) values. A confusion matrix summarizes the results of the random forest classification algorithm and defines these values as shown in
Table 2.
The overall accuracy is determined by the proportion of the total number of student predictions that were correct over all type of predictions made, and can be calculated using equation 1:
#Import required function
from sklearn.model_selection import cross_val_score
#To calculate the accuracy of the model using k-fold cross validation
score_accuracy_mean = cross_val_score(forest_base, X_testSMOTE, y_trueSMOTE, cv=kf, scoring='accuracy').mean()
print(score_accuracy_mean) #View the mean of the CV validation results for accuracy of the model.
The true positive rate (TPR), also known as sensitivity or recall, is defined as the proportion of students who failed a course which were correctly classified as having failed a course and was calculated using equation 2:
#To calculate the recall of the model using k-fold cross validation
recall = cross_val_score(best_grid_model, X_testSMOTE, y_testSMOTE, cv=kf, scoring = 'recall').mean()
print(recall) #View the mean of the CV validation results for recall of the model
The true negative rate (TNR), also known as specificity, is defined as the proportion of students who did not fail a course which were classified correctly as not failing and was calculated using equation 3:
There is no specificity or TNR option in the cross_val_score() and so we must define specificity using the make_scorer() function.
#Import required function
from sklearn.metrics import make_scorer
#Define specificity
scoring = make_scorer(recall_score, pos_label=0)
#Use our defined specificity as the type of score that is calculated
score_specificity_mean = cross_val_score(forest_base, X_testSMOTE, y_trueSMOTE, cv=kf, scoring = scoring).mean()
cross_val_score(forest_base, X_testSMOTE, y_trueSMOTE, cv=kf, scoring = scoring)
print(score_specificity_mean) #View the mean of the CV validation results for specificity of the model
Precision is the number of correct predictions a student failure out of all students which were classified as experiencing a failure and was calculated using equation 4:
# To calculate the precision of the model using k-fold cross validation
score_precision_mean = cross_val_score(forest_base, X_testSMOTE, y_trueSMOTE, cv=kf, scoring='precision').mean()
print(score_precision_mean) #View the mean of the CV validation results for precision of the model
The F-score (F1) is a weighted harmonic average of precision and recall and is calculated using equation 5:
#To calculate the F1-score of the model using k-fold cross validation
score_f1_mean = cross_val_score(forest_base, X_testSMOTE, y_trueSMOTE, cv=kf, scoring='f1').mean()
print(score_f1_mean) #View the mean of the CV validation results for precision of the model
Receiver operating characteristic (ROC) curves are created by plotting the sensitivity versus the specificity at different cut points for binary classification models [
47]. The area under the ROC curve (AUC) was calculated from the ROC curves and is the last validation method we will use to assess each model. This single numerical score is considered a superior method compared to accuracy when evaluating the performance of prediction models [
48].
# To calculate the ROC curve AUC of the model using k-fold cross validation
#score_auc_mean = cross_val_score(forest_base, X_testSMOTE, y_trueSMOTE, cv=kf, scoring = 'roc_auc').mean()
print(score_auc_mean) ) #View the mean of the CV validation results for ROC curve AUC of the model
2.6.3. Tuning of the Random Forest Model:
Now that we know our model results, we can try to use data-driven approaches to improve the performance of our model. This means we will try to tune our model’s hyperparameters to improve the performance of the model. First, we will define a list of hyperparameters, or the parameters that are specific before training the model. This will help determine the best parameters that are learned during the training process of the model. We will demonstrate two common data-driven hyperparameter approaches, (1) Random search [
49] and (2) Grid Search [
50]. We first use random search as it requires lower computational power and will test a user specified random number of combinations in the hyperparameter grid. Once we have the best estimates using random search, we will define a new hyperparameter grid with values closer to the selected output values from the random search. We will then use grid search which will look at every possible combination in the hyperparameter grid.
##Assess hyperparamters to try to improve upon base model:
#Import required functions:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
# Create the hyperparameter grid for first the random search function
hyper_grid = {# Number of trees to be included in random forest
'n_estimators': [150, 200, 250, 300, 350, 400],
# Number of features to consider at every split
'max_features': [‘sqrt’],
#Maximum number of levels in a tree
'max_depth': [10, 20, 40, 60, 80, 100, 120, 140, 160, 180, 200],
# Minimum number of samples required to split a node
'min_samples_split': [2, 4, 6, 8, 10],
# Minimum number of samples required at each leaf node
'min_samples_leaf': [1, 2, 4, 6, 8, 10],
# Method of selecting samples for training each tree
'bootstrap': [True, False]}
#Initiate random forest base model to tune
best_params = RandomForestClassifier(random_state=(23))
#Use random grid search to find best hyperparameters, uses k-fold validation as cross validation method
#Search 200 different combinations
best_params_results = RandomizedSearchCV(estimator=best_params, param_distributions= hyper_grid, n_iter=200, cv=kf, verbose=5, random_state=(23))
#Fit the random search model
best_params_results.fit(X_trainSMOTE, y_trainSMOTE)
#Find the best parameters from the grid search results
Print(best_params_results.best_params_)
#Build another hyperparameter grid using narrowed down parameter guidelines from above
#Then use GridSearchCV method to search every combination of grid
new_grid = {'n_estimators': [250, 275, 300, 325, 332, 350, 375],
'max_features': ['sqrt'],
'max_depth': [160, 165, 170, 175, 180, 185, 190, 195],
'min_samples_split': [1, 2, 3, 4, 5, 6],
'min_samples_leaf': [1, 2, 3],
'bootstrap': [True]}
#Initiate random forest base model to tune
best_params = RandomForestClassifier(random_state=(23))
#Use GridSearchCV method to search every combination of grid
best_params_grid_search = GridSearchCV(estimator=best_params, param_grid=new_grid, cv=kf, n_jobs=-1, verbose=10)
#Fit the gridsearch model
best_params_grid_search.fit(X_trainSMOTE, y_trainSMOTE)
#Get the results of the search grid form the random forest model
best_params_grid_search.best_params_
#Using the results of the best parameters, we will create a new model and show the specific arguments.
best_grid_model = RandomForestClassifier(n_estimators=375, max_features='sqrt', max_depth=(160), min_samples_split=2, min_samples_leaf=2, bootstrap=True)
#Best model based upon grid
best_grid_model.fit(X_trainSMOTE, y_trainSMOTE)
After creation of the model, the performance metrics can be calculated, and the veterinary educator will need to decide which model is best, the base model, or the model using the new parameters.
2.6.3. Determining the Most Important Features of the Random Forest Model:
After selecting the best model based upon the performance metrics, we need to determine which features contribute most to the model being able to predict the outcome of a student. Each feature has a score. A higher score means it contributes more to the model’s prediction whereas a lower score indicates the feature has a lower contribution to the model’s prediction. There are a variety of approaches to calculating the feature importance values, and some approaches depend on the ML algorithm selected whereas others can be used for a variety of ML algorithms [
11].
We will use two methods to assess the features of importance, the Gini importance (or mean decrease Gini), and the visual Shapley additive explanations (SHAP). Gini importance is the most common method used to determine the relative depth or rank of a feature used as a decision node within the random forest model [
19,
51]. The most important features have a larger value and will be located most often at decision nodes near the top of the individual trees (
Figure 1). By being near the top of the tree, a larger percentage of the input samples are utilized by that specific decision node. This means that the feature contributes more to the final prediction decision compared to decision nodes lower on the tree [
31,
51]. Features of importance determined by SHAP are based upon classic game-theoretic Shapley values. SHAP measures local feature interaction effects and helps provide a better understanding of the overall model [
52] based upon combining the explanations for each student outcome that is predicted.
#Most important features from best performing random forest model, Gini importance
feature_imp = pd.Series(best_grid_model.feature_importances_, index=X.columns)
feature_imp = feature_imp.sort_values(ascending=False)
print(feature_imp)
#Import required package
import shap
#Most important features from best performing random forest model, SHAP values
shap_feature_imp = shap.TreeExplainer(best_grid_model)
shap_values = shap_feature_imp.shap_values(X_testSMOTE)
shap.summary_plot(shap_values, X_testSMOTE) #Shows results in a plot
Model employment:
The best performing model is selected and used to address the educational research question or problem.