Motivation
Using statistics to predict the outcomes of sporting matches has always been popular, from fans to betting punters. Using machine learning lets us examine large sets of data and identify patterns.
With the FIFA 2018 World Cup now into the Round of 16, with some surprises in the pool games, we could try to predict the outcome of the finals and see who we might think will win.
Football Data
Sports analytics has been historically most active in baseball, we can see the shear number of metrics collected (just look at how many are listed here!). Baseball is simpler to analyse and record given the nature of the game. That is, striking/fielding (isolated pitch+bat events). Football however is classed as an invasion game, which makes it a lot more difficult to analyse. The nature of invasion games are much more complex, due to the nature chaotic nature of attack vs defence. The amount of data published for football has been fairly limited, typically including things like team line-ups, goals scored and penalties given.
Data Sets
Fortunately, data sets have been made available:
- Set of almost 40,000 match outcomes. From the first official match in 1872 all the way up to 2018
- International results - FIFA soccer rankings.
- Fifa world rankings (scraped using fifa_ranking) - Football team Elo ratings - eloratings
- Details of the squads for all 32 teams participating in the 2018 World Cup
- 2018-fifa-world-cup-squads - FIFA world cup match data fifa-world-cup
Team Ratings
There are two primary sources of team ratings, FIFA's official rating and the Elo rating measure. The Elo rating system was created by Arpad Elo, a Hungarian-American professor, to rank Chess players (and other single player ratings for zero sum games). This rating system has been modified, incorporating additional data such as goal margin and match importance, to apply for Football teams as well. FIFA use a modified version of the Elo rating system for ranking women’s international teams. The men’s international teams use a different rating system, however FIFA announced on the 10th of June, that they are switching to an Elo based system. These ratings are more of a ranking device than a predictor (they rank according to relative strength to each team).
Exploring the data
We load some of these into Power BI and explore some of the data. Click the full screen arrows to maximize, and navigate the data.
We can then try to load some of this into a model, to see what we think the results might be of the upcoming finals.
Predicting the Finals
A really good article appeared in the Economist a few weeks ago, which highlighted the difficulty in predicting football outcomes. Most models created before the World Cup started, predicted that Germany would be the overall winner. Unfortunately, chaos ultimately decided Germany would be knocked out in the pool rounds (the first time in 80 years!). Likewise, no model could take into account things like the Spanish team's manager being fired early on, or the match fitness levels of Egypt's star player Mo Salah.
With the pool games over though, we know who will make it through to the Round of 16 games.
Team | Place |
Uruguay | Group A winner |
Russia | Group A runner-up |
Spain | Group B winner |
Portugal | Group B runner-up |
France | Group C winner |
Denmark | Group C runner-up |
Croatia | Group D winner |
Argentina | Group D runner-up |
Brazil | Group E winner |
Switzerland | Group E runner-up |
Sweden | Group F winner |
Mexico | Group F runner-up |
Belgium | Group G winner |
England | Group G runner-up |
Colombia | Group H winner |
Japan | Group H runner-up |
Using open source Python libraries, we can train a simple logistic regression model based on the prior data, encompassing past performance, FIFA rating data as well as performance to date at this year's world cup. This gives us a probability expectation of each team winning in the Round of 16 matches, with which we can simulate our final bracket, and determine who might be the winner. We can then present this data using Microsoft Power BI:
As we can see, Brazil is set to win.
Further Information
This is just an example of using data to explore real world problems. Feel free to contact us should you want to discuss how we can help you explore your data.
Further reading on football prediction
If it is of interest, the following is some good reading on football prediction:
- Pettersson, D., & Nyquist, R. (2017). Football Match Prediction using Deep Learning. Department of Electrical Engineering, Chalmers University of Technology.
- Constantinou, A. C., Fenton, N. E., & Neil, M. (2012). pi-football: A Bayesian network model for forecasting Association Football match outcomes. Knowledge-Based Systems, 36, 322–339.
- Arabzad, S. M., Tayebi Araghi, M. E., Sadi-Nezhad, S., & Ghofrani, N. (2014). Football Match Results Prediction Using Artificial Neural Networks; The Case of Iran Pro League. Journal of Applied Research on Industrial Engineering, 1(3), 159–179.
- Leung, C. K., & Joseph, K. W. (2014). Sports Data Mining: Predicting Results for the College Football Games. Procedia Computer Science, 35, 710–719.
- Carpita, M., Sandri, M., Simonetto, A., & Zuccolotto, P. (2014). Chapter 14 - Football Mining with R . In Y. Zhao & Y. Cen (Eds.), Data Mining Applications with R (pp. 397–433). Boston: Academic Press. https://doi.org/10.1016/B978-0-12-411511-8.00015-3