Hackathon: European Championship Prediction
It had been a while, but we were finally allowed back to the office for a traditional hackathon. After the Dutch team (and therefore the entire nation) had suffered a devastating loss at the European Championship (EC), there was only one way to find solace: by beating our colleagues in a competition. The goal of this competition: make the most accurate prediction of EC match outcomes using machine learning models.
Gathering data and feature engineering
The evening started with a nice dinner while sharing our personal predictions for the tournament, but we quickly moved on to discussing the data. The group was split into teams with varying backgrounds and experiences. Each team was given a historical record of international football matches, including which nations played and their final scores. The first objective was to predict a win, draw, or loss for the home team for all matches in the group stage of the current European Cup — regardless of the number of goals scored. The second objective was to predict the course of the knock-out phase of the EC.
With the compute clusters warmed up and all hands on the keyboards, the teams got to work to enrich data. Everyone received a working Python notebook, which included example code that handled the initial data loading and generated the format for our final results. We then combined that data with records of FIFA world rankings, player lineups, and other match and team statistics. Additionally, whenever there is money to be made people will put in some effort and try to make a correct prediction. Hence, we also scraped a historical record of betting odds from the web, as given by the bookies. Feature engineering mostly focused on calculating differences in past performance between the teams in a match.
The predictions
The teams tried different types of machine learning models, including logistic regression, XGBoost, random forests, and the recently introduced Databricks AutoML. The best model for predicting the group stage outcomes was a logistic regression model, achieving 50% accuracy. This performance can be explained by the many unexpected results of the current EC being fundamentally difficult to predict.
Regardless of the model type, we found that draws were especially difficult to predict. Moreover, upon inspection of feature importances, as judged by the employed models, we found that many features held little significant predictive power — the best predictors being the average win rate of a team and the difference in FIFA ranking of the teams in a matchup.
As for the second objective, the winning team of the evening has predicted that England will go on to win the championship. However, like many teams have learned during this tournament: past success does not guarantee anything.
And this was proven in the finale — even though England made it to the last round, Italy turned out stronger and took the cup back to Rome. We also entered the predictions of our data-driven teams (APP & YPP) in the Itility EC prediction pool and they both ended up in the top 5 (out of 42). In the end, Itilian Jan Kuijken had the most accurate prediction. The data-driven teams would be smart to contact him to see how they can improve for the upcoming World Championship in 2022.