Data preparation
As for team members who were responsible for model building they all started by cleaning their dataset. Most of the teams removed all non-predictive variables such as game_id and shot_id. One of the teams was so effective at it that they run their model with only one variable: action type. While indeed that variable had the most predictive value (one team estimated it at 31%), by itself it wasn’t sufficient to accurately predict success of a shot.
Analyzing the variables like shots by action type and shots by opponents
Others tried to expand their dataset by creating derived variables, such as 'games played in the last 14 days'. At the same time, they reduced the number of categories by aggregation. Unfortunately, this approach proved to be too ambitious for such a short time.
Another team took a very systematic approach of analyzing each variable’s relevance using caret package in R. This approach resulted in 0,607 log loss score, good for a very proud second place!
Building a model
When it came to building the model, most of the teams relied on simple, effective solutions, such as linear regression, XG Boost and/or random forest. No matter which approach was chosen, action type always stood out as a variable of highest relevance with dunking shot being the most successful type of shot.
The winning team got an impressive 0,6 log loss score, however many teams were right behind them with score differences in far decimal points. The “if you don’t try you always miss” was among the most impactful and relevant advise, and it not only applies to basketball!
