The steps followed as part of this approach aim at predicting the risk of attrition and at generating pipeline suitability scores for Naval student aviators. As such, they are typical of a data science approach and include the identification and ingestion of relevant and available datasets, the pre-processing and extraction of features of interest and the training and testing of a number of machine-learning models. Each of these steps are discussed in detail below.
3.1. Datasets
Primary training stage academic and flight item grades were provided in .csv format for nearly 8,000 naval student aviators that were trained in between 2012 and 2019. The academic and flight tests given to students are dependent on the syllabus followed, and changes to syllabus over the years resulted in differences in the number and type of available flight and academic graded items. Information about training outcomes for the student aviators in primary, intermediate, and advanced stages of training was also made available as additional .csv sheets. Each student in these tables was identified with a unique "ID_CODE" number, which was used when merging related sheets. An overview of the different datasets leveraged for the purpose of this work is provided in
Figure 5 and further discussed in the sections below.
3.1.1. Flight Grades Data
Ten .csv sheets were related to primary stage data which included primary stage syllabus status, attrition reason if any, syllabus-event names, flight hours, flight-item grades, and the Maneuver Item File ("MIF") data. The primary stage training included flight events such as "abort take-off," "arcing," and about 140 others with small differences from one syllabus to another. The ten tables were concatenated using the student IDs. A small number of students had been trained in more than one syllabus and were eliminated from the analysis. The flight grades were given on a scale of 1 to 5.
3.1.2. Academic Grades Data
For each student, about 100 entries were available, each corresponding to a syllabus event such as C2101. In many instances, a syllabus event was repeated due to an incompletion or the first instance being a warm-up event. In each entry, instructors tested and provided grades for only one or a small number of flight-graded items. Hence, most of the columns in each row were empty, resulting in the data being extremely sparse. To address this challenge and facilitate the use of data by machine-learning models, aggregation and feature extraction were performed which required input from subject matter experts. Eight .csv sheets with academic test grades were available. Most students in these tables had one row entry each with many columns filled for each academic grade obtained. However, academic test grades were not considered in the analysis for two reasons: 1) the large variance in the number of grades available for each student and, 2) academic failure not being the primary reasons for attrition.
3.1.3. Training Outcomes Data
Three tables with information on training outcomes at the primary, intermediate, and advanced stages of training were available. These tables included information such as student ID, aircraft pipeline assigned, syllabus completion status, and Naval Standardized Scores (NSS). For the students who were unsuccessful, the reason for attrition was also provided. For the purpose of this effort the completion status (attrite or successful) from different training stages were used to generate the targets for the attrition risk prediction and the pipeline recommendation models. Overall, about 10,000 unique student IDs were recorded with approximately 9,000, 3,000, and 6,000 students IDs available in primary, intermediate, and advanced training datasets, respectively. Among them, 2,243 unique IDs were present in both primary and intermediate datasets and 5,262 unique IDs were present in both primary and advanced datasets, as shown in
Figure 6. The entries that are common to the flight grades and outcomes datasets provide both features and targets to train the machine-learning models. However, the syllabus completion status in each training stage is not directly used. The objective of the classification machine-learning models is to differentiate students who would complete all phases of training from those who would not (i.e., those who do/would not finish all stages of the training). Identifying the particular training stage at which a student would drop is not addressed as part of this effort. Finally, flight proficiency and skills required for different aircraft pipelines are different. Consequently, using intermediate and advanced stage syllabus completion status as targets, without pipeline information, is not optimal. As a result, pipeline recommender models, which indicate the suitability of a student for a particular aircraft, were also developed. These models are specific to each pipeline and use corresponding training completion status as targets.
3.2. Data Cleansing and Features Extraction
The datasets originated from different sources and required standardization and cleaning. The data has both numeric and nominal features, spelling errors, and multiple formatting styles, which were standardized. For example, some entries in the advanced stage syllabus track are “Adv_Stk”, while some are “ADV_STK”. One-hot integer encoding was used for categorical features. Through exploratory data analysis, such corrections were made, and outliers and erroneous inputs were removed.
The data that would be available for a student aviator at different points in training was identified along with the corresponding target(s) for prediction. At any selected point during training, a set of features and a target provide the data necessary to train a supervised machine-learning model.
For the attrition prediction models: the syllabus completion status, which takes two values: “Complete” or “Attrite”, according to whether a student successfully completed all stages of training or not was used as the target.
For the pipeline recommender models: each machine-learning model pertained to one aircraft pipeline. For a selected aircraft pipeline, the student aviators that were successful were given a positive label and all other students were given a negative label, indicating that they were not suitable for that pipeline. Models trained using data labeled in this way would try to mimic the current selection process.
Eighty-five percent of the 90 million cells in the concatenated flight grades table (600,000 rows and 140 columns) were empty. As a result, an aggregation and feature extraction strategy was needed so as to not lose key information. It is expected that a grade given by an instructor for a flight-graded item is indicative of the proficiency for that maneuver/skill, such as landing, take-off, headwork, and nearly 140 others. Following inputs from subject matter experts, the data was aggregated according to flight-graded items to capture maneuver/skill-specific proficiency levels as observed by flight instructors.
In particular, statistical features were generated from the multiple non-zero entries in each column for each student aviator. Five features: average, count, minimum, maximum, and trend over time, were calculated and stored for the flight graded items type columns. Other features such as total flight hours, number of days in training, the total number of events, failure rate, and others, were extracted from other columns, as shown in
Figure 7. Overall, this resulted in about 700 features for each of 7,465 student aviators in the primary flight grades datasets. Still, 10% of the cells were empty, which can be accommodated by some machine-learning models. The reduction in empty cells by aggregation and feature extraction processes is shown in
Figure 8, where white areas represent empty cells. Further, columns with only one unique value were eliminated and empty cells were filled in with the mean values of the corresponding columns.
Features datasets were also generated utilizing flight grades available only from the first quarter and up to the second and third quarters of primary training. The split into different quarters was performed based on the average number of events required by each student aviator to successfully complete primary-stage training. This allows for a more continuous attrition risk monitoring. The potential cost savings achieved through accurate attrition prediction is also more precisely calculated as attrition during Q-3 or Q-4 (third or fourth quarter) of the primary stage is more expensive than attrition during Q-1 or Q-2. For example, machine-learning models estimating the risk of attrition between the end of Q-1 and the end of the advanced training stage were trained using only Q-1 grades. Machine-learning models were also trained with all of the primary flight grade data and additional pipeline information from intermediate and advanced stages, when available, to estimate the risk of attrition. Flight grades from the intermediate and advanced stages, if available, should be used for the end of intermediate and advanced stages models to predict attrition in the advanced stage and Fleet Replacement Squadron (FRS) stage, respectively. The different models and the timeline at which they are to be used are depicted in
Figure 9.
Table 1 summarizes the data which was available and leveraged as part of this effort, along with the targets that the models would predict at the different training timelines.
3.3. Training Machine-Learning Models
The attrition risk prediction is approached as a binary supervised classification problem, with the probability of a positive classification (0 to 1) being used as the attrition risk score. In binary classification problems, class imbalance can be a challenge. Class imbalance refers to a significant difference in the number of positive class labels (student attrition) and negative class labels (successful students) as targets. The attrition rates in each of the three (primary, intermediate, and advanced) training stages varied between 3 and 8 percent in the provided datasets. Since machine-learning models perform better with close to equal distribution of class labels, advanced sampling strategies such as adaptive-synthetic (ADASYN) oversampling [
22] and random undersampling (RUS) [
23] were used in different models to reduce the class imbalance. Similarly, the pipeline recommender models were also framed as binary supervised classification problems, and undersampling and oversampling techniques were tested.
Many different types of machine-learning classification-models can be used to demonstrate the aforementioned approach and objectives. Results for relatively simple classifiers to advanced models combined with sampling techniques are generated and reported in this paper. Tested models include logistic regression, support vector machines (SVM), K-nearest neighbors (k-NN), decision trees, random forests, gradient boosting, XGBoost, light gradient boosting machines, and multi-layer perceptrons (MLP). A five-fold data split that randomly allocates 80% of the data to train the models and 20% to test them was implemented. To evaluate and compare the performance of the models, performance metrics that not only consider the accurate identification of attriting students (true positives) but also penalize false positives are needed. If a policy of proactive removal of students with a high risk of attrition is implemented, a false positive prediction would lead to additional costs equal to that needed to retrain another student to the same stage who would go on to be successful. In order to more precisely calculate the savings and additional costs due to accurate attrition prediction or false positives, true positive and false positive metrics were further classified based on when the attrition occurred or proactive removal would have been implemented. Cost savings and additional costs were calculated for each model’s predictions based on the estimated cost of training a student aviator in different stages of training. Other more direct machine-learning model performance metrics such as Area Under the Receiver Operating Characteristic Curve (AUROC), F-1 score, and Matthews Correlation Coefficient (MCC) were also calculated and reported.
The ROC (Receiver Operating Characteristic) curve is a graphical representation of the trade-off between the true positive rate (TPR) and false positive rate (FPR) of a binary classifier, as the classification threshold is varied. The AUROC is the area under this curve, ranging from 0 to 1, where a value of 0.5 indicates random guessing and a value of 1.0 indicates perfect discrimination between positive and negative classes.
F-1 scores range from 0 to 1, and it is a harmonic mean of precision and recall, two metrics that measure different aspects of the performance of a binary classifier. Precision is the proportion of true positives among all the instances that are predicted as positive. Recall is the proportion of true positives among all the instances that are actually positive. The F1 score combines these two metrics to provide a single score that balances the trade-off between precision and recall.
MCC is also a performance metric used to evaluate the performance of binary classification models. MCC metric considers all four possible outcomes of a binary classification problem, including true positives, true negatives, false positives, and false negatives. MCC ranges from -1 to +1, where a value of -1 indicates total disagreement between the predicted and true labels, 0 indicates no better performance than random guessing, and +1 indicates perfect agreement between the predicted and true labels. MCC is less sensitive to class imbalance than other metrics like accuracy and F1 score, and it can be a better metric to use when the classes are imbalanced.
Other data split ratios such as 70-30, 60-40, and 50-50 were also utilized. Oversampling and undersampling techniques were utilized to generate training datasets with 10% 25%, and even 50% (equal distribution) positive class labels. Only the training data was sampled so that evaluation is done only on real instances to generate true performance metrics. After evaluating the trained models on some initial datasets, the top-performing machine-learning algorithms were identified and further utilized. These included XGBoost, random forests, gradient boosting, and light gradient-boosting machines.
Pipeline recommender models were demonstrated to identify strike pipeline-suitable students from among all student aviators. This is one of the key pipelines in which the U.S. Navy and others are facing an acute shortage and is also one of the most demanding. Currently, an NSS score of 50 is used as a cutoff to qualify for the strike pipeline and the available slots are filled in the descending order of these scores. To train strike pipeline recommender/suitability models, all students were given a negative label except those that were selected for and were successful in the strike pipeline. Recommender models for other pipelines can also be similarly trained but were outside the scope of this effort. These models can be trained with all of the primary training stage data or flight grades data only up to the end of Q-1, Q-2, or Q-3. This depends on where continuous tracking of these estimates is the most useful, or how early CNATRA would like to be informed about the number of students suitable for the strike or other pipelines. Based on the suitability scores between 0 and 1, student aviators can be ranked according to the probability of success in that pipeline and selected in that order instead of the NSS values.
3.4. Attrition Costs Modeling and Savings Estimation
The cost of attrition at each of the primary stage quarters, intermediate, and advanced training were estimated from the literature for the strike pipeline (
Figure 4). The cost savings achieved by proactively removing high-attrition risk student aviators for different scenarios were also calculated using data pertaining to these past students. The magnitude of the savings depends on the earliest time a machine-learning model predicted this outcome and when the attrition actually occurred.
Figure 10.
Attrition cost modeling and savings estimation excel tool with sample results.
Figure 10.
Attrition cost modeling and savings estimation excel tool with sample results.
First, annual attrition costs to the U.S. Navy were estimated assuming 1,100 students in training, which is approximately the annual throughput of the U.S. Navy’s pilot training. Different aircraft (F-35, F-18, P3-P8, tilt rotors, and others) cost different total amounts to be trained on. The average cost to train aviators in different pipelines (including FRS stage) was estimated to be $6M. The cost of intermediate and advanced training was approximated accordingly. The attrition costs were then estimated as approximately $100 M per year using the observed attrition rates in primary (by quarter), intermediate, and advanced training stages.
Potential cost savings to the U.S. Navy were estimated by utilizing machine-learning models’ true positive and false positive performance metrics for each primary quarter, intermediate, and advanced stage. For example, if the end of primary stage attrition was predicted at the end of the first quarter of primary training, the direct cost savings are equal to the cost of providing training to a student aviator in the second, third, and fourth quarters of the primary stage. Each primary stage quarter was assumed to require an equal amount of resources. Assuming equal cost provides a conservative cost savings estimate as, in reality, later parts of primary training require more flight hours than earlier parts. For intermediate and advanced stages, the exact information on when a student aviator left the training program is not available and is assumed to be at the end of those stages. An Excel-based cost savings calculator was developed where a machine-learning model’s performance metrics can be input. This sheet calculates the cost savings due to true positive predictions, added costs due to false positives, and the net cost benefit or loss. Using this tool, at each decision point in time, the model with the best performance metrics and hence the most cost-benefit, if any, is chosen as the best one. After all such models’ results are entered, the cost savings per training stage as well as net value are calculated.