3.1. Data Collection and Preprocessing
This study used NBA game log data from the 2024-25 season, collected as of January 5, 2024. Multiple datasets were collected though an API including game-level information, individual player performance metrics, player metadata, and seasonal statistics. The study focused on the current season data to make sure the study provides timely insights because NBA games are fast-changing and highly time-sensitive.
Data were collected from nba.stats.com using an API. The game information included the result of the game (win vs. lose), which was used as a response variable in this study. The player performance metrics were collected for each game and included field goal percentage, free throw percentage, rebound, assist, steal, block, turnovers, and personal fouls. Initial correlation analysis was conducted to manually remove the highly correlated variables. This manual feature selection helped improve the model interpretability. For example, the total points and total minutes a player played during a game were excluded. This is because key players typically play longer and score higher than bench players, which also reflects on other metrics like total points, field goal percentage, and free throw percentage. The player metadata included players’ heights and years in the league. Weights were excluded because they were highly correlated with height. Seasonal player data had two parts, including the previous season and the current season. The data from the previous season included the overall percentage of field goals, the percentage of free throws, rebound, assist, steal, block, turnovers, personal fouls, and if the player was traded in the previous season. The current season data included player’s age, player’s team, and if the player was traded in the current season.
Pre-processing included handling missing data, removing outliers, and encoding categorical features to reduce the noise in the data. Missing values for previous season stats including steals (STL_LAST_SEASON) and turnovers (TOV_LAST_SEASON) were set to 0. This could be related to new players or injured players who did not play in the previous season. This process helped avoid errors in the model training process. Whether or not the player was traded last season (TRADED_LAST_SEASON) was also set to 0 when the player was not traded in the previous season. The total minute each player played each game was calculated. The original variable included minutes and seconds, for example, 35:14. Thus, the original variable was converted to the seconds and converted to the total minutes by dividing by 60. Players who played less than 1 minute in a game were excluded because the player scores were unlikely to be relevant to the overall game performance. Categorical variables such as team names, game identifiers, and player identifiers were encoded to be used in the machine learning models.
Feature engineering was conducted through creating new features, including height in inches, overtime indicator, rest days, home game indicator, and preceding game win streak, to provide additional context to the model and improve the predictive power.
Appendix A shows the variables used in the final model. The correlation analysis was conducted to see the relationships between final variables before the model building stage.
The traded indicators for 2024-25 and 2023-24 seasons were created to see if a player was traded. This was expected to show the trade impact on the team in the model. The career stats were pulled for each player. Players with the team name ’TOT’ (Total) indicated as the players had been traded mid-season by showing total stats across the two teams (previous team and traded team).
The heights were recorded as feet and inches. Thus, the variable was converted to inches to be used in the machine learning models.
The overtime indicator was to understand if an overtime game makes outcomes harder to predict due to the narrow scoring margins. Any game time with more than 240 minutes (minutes per player on the court) in the data was encoded as 1. 240 minutes represents minutes per player on the court. There are 5 players on the court per team and 12-minutes quarters.
The rest days counter was used to understand the level of team’s fatigue. It was less precise compared to the wearable data. However, it was the best available estimate of fatigue level assuming all other factors including training routines are the same. First, the game log was sorted by game date for each team. Then, the difference between consecutive games was evaluated. The first game for each team did not have the preceding game to compare. Thus, rest days for these games were set to 0.
The home game indicator was created to see if teams are more familiar with their home stadium and how this would impact the game results. In the game-related data, the game match variable showed if the game was a home game (NYK vs. BOS) or away game (NYC @ BOS, which means New York team at Boston’s stadium). Thus, if the variable included ’vs.’ the game was encoded as a home game (1), otherwise as an away game (0).
The preceding game win streak was to understand if the consistent recent wins would impact the current game outcome. First, the data were sorted by team identifier and the game date to get the chronological order. For each team, a counter was generated to track the consecutive wins. If the current game result was ’W’ then the counter added 1 and if the current game result was ’L’ then the counter was reset to 0. This variable was appended as a new column for each team and the game. For the previous game win streak, the variable for the current game was shifted down by one game for each team. The first games of the season were set to 0 because these games did not have any previous game results.
The game identifier and player identifier were encoded as the model inputs to make sure each record represented a player and a game, which helped prevent data leakage. This helped the model link the player performance, game condition, and results during the training process. In the testing and validation, the model would see similar patterns in previously unseen players and games.
The data were split into training, validation, and test for the final model. The training set included all the teams except Boston Celtics and New York Knicks, the validation set included only Boston Celtics player and game data, and the test set included New York team player and game data. The split was to evaluate the model’s ability to generalize across different teams. As a result, the proportion for training, test, and validation datasets were was 93%, 3%, and 3% respectively.
The team names were excluded from the model inputs because the model tended to prioritize the higher-performing teams such as Cleveland Cavaliers and Oklahoma City Thunder. Thus, the model was overfitting towards to the performance patterns of these teams using the training set and made it less generalizable when testing and validating on the new sets of data. This also helped the model become less dependent on the team strategies, which was difficult to measure with the current data inputs. Thus, the model was set up to learn the general patterns from the overall player statistics and games-related data without team indicators and predict the game outcomes for specific teams.