1. Introduction
CrossFit® (CF) is a conditioning and training program that has garnered recognition and piqued the interest of the physically active population. Initially designed for military training, it has gradually gained popularity among civilians. This program is founded on a series of intricate exercises encompassing running, weightlifting, Olympic gymnastics, and ballistic movements [
1]. The rapid rise in the popularity of the training program has also generated increased scientific interest over the past decade [
2].
Training is organized into daily sessions referred to as ‘workouts of the day’ (WODs). These sessions encompass a variety of exercise modalities, including calisthenics (such as pull-ups, muscle-ups, and burpees), metabolic conditioning (involving cardiovascular exercises like running and rowing), and weightlifting [
3]. The weightlifting component covers both Olympic movements, including the snatch and clean and jerk, as well as powerlifting movements, such as the squat, bench press, and deadlift [
4,
5]. This comprehensive approach has resulted in notable physical conditioning improvements, and the number of practitioners has significantly increased in recent years, with over 15,000 affiliates worldwide [
6,
7].
Numerous studies have been conducted in the field of CF. Several systematic reviews have demonstrated that CF workouts exhibit consistent anaerobic and aerobic characteristics, leading to significant metabolite accumulation, such as elevated blood lactate concentrations ranging from 6 to 18 mmol/L. These workouts also result in increased markers of muscle damage, including Creatine-phosphokinase (CPK), interleukin-6 (IL-6), and IL-10, as well as muscle fatigue, as evidenced by reduced countermovement jump (CMJ) values, mean power output (MPO), and plank time [
8,
9,
10,
11].
Furthermore, multiple studies have delved into examining the physiological adaptations resulting from CF training, particularly within the cardiovascular, respiratory, and muscular systems. These investigations have revealed associations between various factors, including body mass [
12], strength and anaerobic power [
12,
13,
14,
15,
16], aerobic capacity [
9], sport-specific skill [
14,
16], and experience [
15], and both CF workout performance and competitive ranking. These findings collectively suggest that athletes must focus on training across these diverse attributes to excel in competition [
17].
Furthermore, there is a discernible upswing in contemporary research efforts dedicated to the intricate scrutiny and prognostication of CrossFit (CF) performance. One such study, conducted by a small group of 16 participants, involved them performing CF and analyzing the data to predict performance [
18]. Additionally, other studies, as indicated in, focused on predicting performance by gathering surveys regarding participants’ athletic experiences and physical fitness [
19,
20].
However, it’s important to note that these studies employed very limited sample sizes. These smaller datasets possess limitations when it comes to predicting results, as they may be influenced by external environmental factors and individual characteristics.
The amelioration of these limitations and the accurate prediction of objective CF performance necessitate the utilization of comprehensive big data analysis. The significant potential of big data in the realm of sports is already undergoing validation across a diverse spectrum of athletic disciplines [
21].
Unlike previous research, we harness the power of machine learning techniques, specifically designed for handling extensive datasets. Machine learning is a burgeoning field at the intersection of computer science and statistics. It empowers algorithms to learn patterns and make predictions from data.
The validity of employing big data analysis using machine learning is underscored by its remarkable success in predicting sports performance across various events [
22,
23,
24]. Machine learning techniques have revolutionized our ability to glean meaningful insights from vast datasets. Their application in sports analytics has demonstrated impressive accuracy, and they have the potential to uncover nuanced relationships in CF performance data that traditional approaches might overlook.
To the best of our knowledge, no previous work has explored CF performance prediction using the big dataset (over 400,000 datas), which we have leveraged for our research. This study builds upon the limited sample sizes of previous research by harnessing the power of big data analysis and machine learning techniques. Furthermore, the insights gained from this study can serve as a foundation for evidence-based training programs tailored for CF athletes, aiming to maximize their physical capabilities while minimizing injury risks.
In this study, our research methodology employs machine learning techniques, specifically Random Forest (RF) and Multiple Linear Regression (MLR), to develop predictive models. These models will be instrumental in forecasting CF performance. Additionally, we will utilize feature importance analysis to evaluate the significance of various attributes in influencing performance outcomes. This approach allows us to gain insights into the key factors that contribute to an athlete’s success in CF, advancing our understanding of performance determinants in this context.
The primary objective of this research is to leverage machine learning techniques to construct a robust performance prediction model using the CF athlete dataset. This model’s primary goal is to provide accurate forecasts and insights into specific aspects of CF performance, as detailed in the subsequent sections of this paper
2. Materials and Methods
Figure 1 illustrates the data processing workflow, encompassing raw data processing, data preprocessing, exploratory data analysis (EDA), model learning, evaluation, and the derivation of final data-driven products.
2.1. Data Preprocessing and Exploratory Data Analysis
To better understand the characteristics and performance of CF athletes, a comprehensive dataset has been collected from competitions and events around the world. By analyzing the data, we can identify trends and patterns that may be associated with higher levels of performance, such as age, gender, training volume, and exercise selection.
We utilized the Kaggle dataset [
25], an open-source data repository, which contains an extensive collection of athlete data. This dataset encompasses essential athlete demographics, including age, gender, and training experience, in addition to performance metrics such as time, weight lifted, and repetitions completed. Notably, the dataset encompasses a wide range of CF events and competitions, providing a wealth of information on athlete performance in various contexts and settings. In its entirety, the dataset comprises 423,006 rows and 27 columns.
From the initial dataset consisting of 27 columns, we meticulously curated a subset of 12 columns to form the foundation for our performance prediction model. This process involved deliberately removing columns unrelated to performance, such as ID, name, team, and affiliates. The selected subset encompasses a wide range of factors, including crucial demographic variables like age, gender, height, and weight. Moreover, it encompasses performance metrics relevant to distinct exercise categories: ‘Strength’ exercises, comprising back squat and deadlift, ‘Power’ exercises, which include clean & jerk and snatch, and ‘Conditioning’ exercises, specifically fran and grace. Additionally, our dataset incorporates essential insights into the athlete’s environment and training experience, thereby offering a holistic perspective on the multifaceted determinants of CF performance.
In our demographic data preprocessing, we initially considered only adults over 18 years of age and removed data entries that did not specify gender (non-male/female). Furthermore, we selected heights within the range of 4 to 8 feet, excluding any outliers such as weights recorded as ‘1750lbs’ and ‘2113lbs’. In the processing of exercise movement data, we excluded records expressed as negative numbers and any entries that exceeded world records. Subsequently, to maintain uniformity and ensure appropriate scaling, we applied the standard scaler to all the records. Lastly, we categorized the training environment as ‘house’ or ‘gym’ and quantified training experience into five groups: less than 6 months, 6-12 months, 1-2 years, 2-4 years, and over 4 years. To ensure compatibility with the model, we further encoded these categorical variables into numeric format using one-hot encoding.
All of the methods utilized in this study, including data preprocessing and machine learning model training, were conducted within the open-source Python (Ver 3.9.12).
2.2. Predictiction Models
2.2.1. Random Forest (RF)
RF, a robust ensemble learning technique, capitalizes on the amalgamation of multiple decision trees to bolster predictive accuracy while mitigating overfitting. The essence of RF lies in the construction of numerous decision trees, each trained on distinct data subsets generated through bootstrapping. These individual trees collectively coalesce to form a "forest" of models. Notably, RF introduces an element of randomness in feature selection during the tree-building process, introducing an additional layer of diversity into the model. This diversity, coupled with the amalgamation of predictions from the individual trees, culminates in the development of a highly accurate and resilient predictive model. [
26].
In the context of predicting sports performance, RF has garnered considerable attention for its suitability and effectiveness [
27]. CF performance is inherently multifaceted, influenced by a multitude of variables spanning physical attributes, athletic experience, location, and movement skills. The flexibility of RF allows it to accommodate this complexity, making it well-suited for modeling the intricate relationships between these variables and performance outcomes [
28]. The advantages of RF include its ability to handle high-dimensional data, capture nonlinear relationships, and provide valuable insights into feature importance [
29]. These attributes make RF an invaluable tool for CF performance prediction, enabling the development of models that not only deliver accurate forecasts but also shed light on the key factors influencing an athlete’s performance.
2.2.2. Multiple Linear Regression (MLR)
Multiple Linear Regression (MLR) is a robust statistical method employed for investigating intricate relationships between a dependent (response) variable and a multitude of independent (explanatory) variables. MLR serves as an extension of the fundamental concept of simple linear regression, which primarily addresses scenarios involving a single predictor variable, to scenarios marked by the presence of multiple predictors [
30].
The utility of MLR transcends disciplinary boundaries, finding application in diverse domains encompassing public health and sports science [
31,
32,
33]. It thrives in the modeling of intricate interactions among multiple variables and facilitates the generation of predictive models rooted in these intricate relationships. Notably, MLR contributes essential insights into the identification of pivotal independent variables that exert significant influence in elucidating the variances observed in the dependent variable.
As such, MLR serves as a fundamental tool for in-depth analysis. It enables us to meticulously explore intricate relationships within our domain of study, providing a structured approach to understanding the complex interactions among multiple variables.
2.3. Performance Metrics
To evaluate the performance of each algorithm, this study utilizes two metrics:
Where is the model’s prediction, is the true value, is the average true value, and n is the total number of predictions.
Mean Squared Error (MSE) is an estimator that measures average squared errors, R Squared (R2) is the percentage of the total variation in the dependent variable that is accounted for by the independent variable [
34]. MSE serves as one of our primary performance metrics. It provides a valuable estimation by measuring the average squared errors within the model’s predictions [
35]. The value of MSE is always positive. A value close to zero will represent better quality of the estimator/predictor (regression model).
Also, the R2 metric as the central tool for evaluating the effectiveness of our models in predicting CF performance. R2 quantifies the proportion of variance in the dependent variable explained by the independent variables in a regression model. It ranges from 0 (no explanatory power) to 1 (perfect fit), with higher values indicating a better model fit to the data. R2 is a standard measure for assessing model goodness of fit in regression analysis [
36,
37], aligning with our study’s objective of uncovering the predictive power of our models. Where
is the model’s prediction,
is the true value,
is the average true value, and
n is the total number of predictions.
2.4. Feature Importance Analysis
To gain insights into the significance of individual variables in shaping CF performance, we employed feature importance techniques.
2.4.1. Random Forest (RF)
RF facilitates the assessment of feature importance, aiding in the identification of key variables that significantly impact model performance [
38]. Like the previous study, which employed feature importance for sports performance analysis, we intend to harness feature importance as a powerful tool to investigate the relationships between variables in our quest to comprehend and predict CF performance [
39]. By utilizing feature importance, we aim to unveil the influence of individual variables and their significance in shaping performance outcomes. This approach capitalizes on the flexibility, interpretability, and exceptional predictive capabilities inherent to the RF model, enhancing the depth of our insights into CF performance dynamics.
2.4.2. Extreme Gradient Boosting (XGBoost) and Adaptive Boosting (AdaBoost)
These two boosting algorithms adopt an iterative approach to model building, aiming to enhance overall accuracy while enabling us to pinpoint pivotal variables that play a substantial role in influencing sports performance. They are notably popular choices for predicting sports performance with diverse sets of variables due to their ability to capture complex relationships within the data [
40,
41].
XGBoost is celebrated for its robustness and exceptional predictive capabilities. This algorithm functions by iteratively training a sequence of decision tree models, with each tree focusing on rectifying the prediction errors of the previous one. By aggregating these so-called "weak learners," XGBoost constructs a formidable predictive model renowned for its proficiency in capturing intricate patterns and relationships within the data. Within the context of CF performance prediction, XGBoost empowers us to delve deep into the complex dependencies that exist between various factors and performance outcomes. Moreover, its adaptability and capability to handle missing data significantly augment our feature importance analysis [
42]
On the other hand, AdaBoost stands out for its aptitude in refining predictive models. This technique operates through the sequential training of a series of weak learners, with each learner being assigned more weight to instances that were previously misclassified. This iterative process revolves around enhancing the model’s accuracy while pinpointing influential variables. In our study, AdaBoost seamlessly complements our approach by accentuating the significance of specific features in predicting performance outcomes. Furthermore, its adaptability and prowess in addressing data imbalances serve to deepen and reinforce the reliability of our feature importance analysis [
43].
In summary, our approach integrates RF and MLR models to predict CF performance, while feature importance techniques (RF, XGBoost, and AdaBoost) allow us to uncover the key variables driving performance outcomes. This methodological framework equips us with the tools needed to understand and forecast the multifaceted nature of CF performance effectively.
3. Results
This section describes the results achieved by machine learning models. The first step in
Section 3.1 is the presentation of performance prediction, and the next part in
Section 3.2 is devoted to the analysis of feature importance.
3.1. Prediction Model
In our pursuit of a comprehensive analysis of CF performance prediction, we applied both RF and MLR models to predict the performance outcomes of four key weightlifting exercises: clean & jerk, snatch, back squat and deadlift. Our analysis will emphasize the utilization of R-squared (R2) values and Mean Squared Error (MSE) as key performance metrics for each exercise.
The performance has been improved by RF and MLR approaches in terms of R2 and MSE values (
Table 1). In particular, the random forest yielded the best result in the deadlift of testing datasets. Our examination of deadlift performance revealed that the RF model excelled, delivering the most promising outcomes across the various testing datasets. Specifically, an R2 value of 0.80 for deadlift showcases the model’s capacity to proficiently elucidate approximately 80% of the variability within deadlift scores. In essence, this indicates that our model effectively accounts for the majority of fluctuations in deadlift performance, leveraging the selected features or predictor variables. Such robust predictive prowess reaffirms the model’s competence in estimating deadlift performance.
In contrast, our evaluation of clean & jerk, snatch, and back Squat unveiled a noteworthy distinction. The MLR model exhibited a slightly superior R2 value, especially notable in clean & jerk performance. Here, an exceptional R2 value of 0.93 signifies the model’s ability to expound upon an impressive 93% of the variability inherent in clean & jerk performance. This result unequivocally positions the MLR model as the preeminent performer among the four exercises scrutinized.
In summation, our research underscores the formidable predictive capabilities of both MLR and RF models within the contours of our dataset, thoughtfully leveraging 11 pertinent features, as elucidated previously.
3.2. Feature Importances
This is summarizing the results of the ranked feature importance from the RF, XGBoost, and AdaBoost, as well as their combination (
Table 2). Among a total of 11 features, the top three with the highest importance values were shown. These findings highlight the significance of various features in predicting performance across a range of exercises.
In clean & jerk, all results of three model indicated that snatch possessed the highest importance value, followed by back squat, and deadlift. Specifically, in snatch, clean & jerk showed the highest importance value at 0.889. This underscores the substantial impact of clean & jerk on snatch performance. Additionally, Clean & Jerk showed the highest importance value in back squat (0.498). For deadlift, back squat was identified as the primary predictor with the highest importance value of 0.605, followed by clean & jerk and gender.
Collectively, these findings emphasize the intricate interplay among individual exercises and their substantial influence on performance outcomes. They also highlight the importance of understanding how distinct exercises interact and influence one another within the context of performance prediction.
The feature importance bar graph plot based on RF, XGBoost and AdaBoost modeling is shown in
Figure 2. The features are sorted based on their importance. Among a total of 11 features, variables are displayed in the order of highest feature importance.
The RF and XGBoost results indicated that snatch possessed the highest importance value, followed by back squat, deadlift, gender, and grace. Likewise, the AdaBoost results indicated that snatch possessed the highest importance value. By averaging the values obtained from the three models, snatch exhibited the highest importance value, followed by back squat, deadlift, gender, and grace in clean & jerk (
Figure 2a–c).
In the case of snatch, clean & jerk consistently exhibited the highest importance values across all three models, registering values of 0.901, 0.920, and 0.848, signifying a robust correlation between these two exercises (
Figure 2d–f).
In the back squat, feature importance demonstrated notably lower values compared to the weightlifting exercise (clean & jerk and snatch). Moreover, the distinctions between the first and second highest values were marginal. The RF results indicated that clean & jerk possessed the highest importance value, and unlike other models, weight was the third highest. However, the results of the other two models indicate that snatch has the third-highest importance value (
Figure 2g–i).
In the deadlift, the backsquat showed the highest importance value in all three models, followed by grace, clean & jerk, and weight in RF, and clean & jerk and gender in the other two models (
Figure 2j–l). Overall, there were 11 different features, but all three models showed similar pattern of feature importance in the four exercises.
To summarize, the CF prediction model, utilizing both RF and MLR, demonstrated outstanding performance across all four weightlifting exercises. Additionally, the significance of each variable was thoroughly confirmed through feature importance analysis employing RF, XGBoost and AdaBoost models.
4. Discussion
This study aims to bridge these gaps by conducting a comprehensive analysis of CF performance data using machine learning techniques. The importance of this research extends to its potential to offer fresh insights and implications for enhancing training thresholds, developing injury prevention strategies, and optimizing overall CF performance. Our dataset comprises a substantial repository of over 400,000 data points, further enriching the depth and breadth of our analysis.
Our study employed RF and MLR models to predict performance in four CF weightlifting exercises: clean & jerk, snatch, back squat, and deadlift. Notably, clean & jerk emerged as a consistent predictor across these exercises. Furthermore, the feature importance analysis consistently unveiled specific patterns among different models for each exercise, affirming the relevance of particular features in predicting performance across the weightlifting exercises.
In this study, we employed two widely recognized predictive models, RF and MLR, commonly utilized in the field of sports analytics for their efficacy in predictive modeling [
44,
45]. Our evaluation, utilizing R2 values and MSE as performance metrics for each exercise, aligns with prior studies in sports analytics [
46,
47], indicating the widespread pursuit of effective predictive models in various athletic disciplines [
48,
49]. While our study focuses on weightlifting exercises within CF, it’s important to recognize the broader context of predictive modeling in sports. For instance, RF demonstrated remarkable predictive capabilities in deadlift performance (R2 = 0.80), in line with findings from previous sports analytics studies emphasizing RF’s proficiency in predictive modeling [
50]. Conversely, our analysis revealed that clean & jerk, snatch, and back squat performance prediction benefited from the MLR, with particularly noteworthy results in clean & jerk (R2 = 0.93), signifying an exceptionally high level of accuracy and reliability in predicting this exercise. Notably, an R2 value of 0.93 implies that our model can explain approximately 93% of the variability in clean & jerk performance, underlining its robust predictive capabilities. A study in sports analytics also underscored the significance of the MSE metric, reinforcing the importance of appropriate metrics and models for predictive analysis [
51]. These parallels across different sports highlight the broader relevance of our research in the realm of sports analytics. Notably, our study represents the first attempt at performance prediction using CF big data.
In addition to achieving highly accurate predictions for these performance outcomes, it is imperative to address questions like "Why did the model make this weightlifting exercise prediction?" Thus, the need arises for providing high-quality, interpretable, and comprehensible explanations for a model’s predictions in sports, utilizing feature importance. Indeed, feature importance plays a pivotal role in delivering detailed explanations of machine learning models in sports [
52,
53].
Our feature importance analysis, conducted through RF, XGBoost, and AdaBoost models, provides valuable insights into the determinants of performance across various exercises. Our results consistently highlight clean & jerk as a primary predictor across exercises, aligning with previous findings that reported significant correlations between back barbell squats and clean and jerk one-repetition maximum (1 RM) (r2 = 0.83, 0.84, and 0.83, respectively) [
54,
55]. Additionally, another study observes very strong relationships between maximum strength measured by 1 RM back squats and competitive weightlifting performances [
56]. These findings align with our results and collectively emphasize the significance of specific weightlifting exercises in predicting overall performance.
For clean & jerk, all three models consistently prioritize snatch as the feature with the highest importance value, followed by back squat. This underscores the significant correlation between snatch and clean & jerk, as highlighted in previous research that underscores the connection between power, speed, and agility in Olympic weightlifting, signifying shared fundamental mechanisms [
57].
Conversely, in the context of back squat, the disparity between the highest and second-highest values is marginal. While RF ranks clean & jerk highest, other models place snatch in this position.
In the case of deadlift, back squat emerged as the feature with the highest importance value, while gender was notably ranked as the third in importance. This observation aligns with previous research that demonstrated significant differences between genders in terms of deadlift power and work output [
58,
59]. Notably, men exhibited greater average power and performed more work compared to women, corroborating the significance of gender in predicting deadlift performance.
Additionally, unlike clean & jerk and snatch, where a strong correlation was evident, snatch did not hold a high importance value in predicting deadlift performance. This trend is consistent with previous studies that reported no significant correlations between snatch or clean and jerk performance and deadlift performance [
60,
61]. Instead, the variance in snatch and clean and jerk performance was better explained by other factors such as overhead press and front squat performance. However, it’s important to note that speed, agility, and power performance, which are crucial in CF, rely on a combination of factors beyond foundational muscular strength abilities alone [
62].
Overall, our feature importance analysis has illuminated the critical determinants of performance across diverse weightlifting exercises in CF. These insights significantly enhance our comprehension of how different exercises interact and collectively influence overall performance within the CF domain.
This study has several limitations. Firstly, self-entered subjective data in our dataset could introduce bias and inconsistency. Another limitation is the omission of external factors such as dietary habits, psychological factors, injuries and individual training regimens, which could influence CF performance but were not included in our dataset. These limitations may have influenced the results, introducing uncertainty and potential inaccuracies into our predictive models. Future research could benefit from more rigorous data collection methods, including validation of self-entered data and regular updates.
The insights gained from identifying the most crucial features for each exercise have practical applications for CF practitioners, coaches, and researchers. With this knowledge, practitioners can tailor their training programs to focus on the specific aspects that matter most for each exercise. Coaches can design more effective and personalized training regimens, optimizing their athletes’ performance potential. Moreover, understanding these key features can help in injury prevention by highlighting areas that may be more susceptible to strain or overuse.
For example, our study revealed that clean & jerk consistently played a pivotal role across different exercises. CF athletes and coaches can use this information to emphasize clean & jerk in their training routines, recognizing its significance in overall performance. Similarly, the prioritization of back squat in deadlift performance underscores the importance of lower body strength, which can guide targeted strength training exercises.
In conclusion, this research not only advances our understanding of performance prediction in CF through machine learning and data-driven techniques but also provides actionable insights for practitioners actively involved in the sport. By identifying the critical features that drive performance, individuals can fine-tune their training strategies, harnessing the power of data-driven decision-making to achieve enhanced performance outcomes while minimizing the risk of injuries. This study exemplifies the synergy between sports, technology, and data analysis, highlighting the potential for further advancements in the field of sports analytics.