Category Archives: Sports

Statistics, Simians, the Scottish, and Sizing up Soothsayers

A predictive model can be a parametrized mathematical formula, or a complex deep learning network, but it can also be a talkative cab driver or a slides-wielding consultant. From a mathematical point of view, they are all trying to do the same thing, to predict what's going to happen, so they can all be evaluated in the same way. Let's look at how to do that by poking a little bit into a soccer betting data set, and evaluating it as if it were an statistical model we just fitted.

The most basic outcome you'll want to predict in soccer is whether a game goes to the home team, the visitors or away team, or is a draw. A predictive model is anything and anybody that's willing to give you a probability distribution over those outcomes. Betting markets, by giving you odds, are implicitly doing that: the higher the odds, the less likely they think is the outcome.

The Football-Data.co.uk data set we'll use contains results and odds from various soccer leagues for more than 37,000 games. We'll use the odds for the Pinnacle platform whenever available (those are closing odds, the last ones available before the game).

For example, for the Juventus-Fiorentina game in August 20, 2016, the odds offered were 1.51 for a Juventus win, 4.15 for a draw (ouch), and 8.61 for a Fiorentina victory (double ouch). Odds of 1.51 for Juventus mean that for each dollar you bet on Juventus, you'd get USD 1.51 if Juventus won (your initial bet included) and nothing if it didn't. These numbers aren't probabilities, but they imply probabilities. If platforms gave odds too high relative to the event's probability they'd go broke, while if they gave odds too low they wouldn't be able to attract bettors. On balance, then, we can read from the odds probabilities slightly lower than the the betting market's best guesses, but, in a world with multiple competing platforms, not really that far from the mark. This sounds like a very indirect justification for using them as a predictive model, but every predictive model, no matter how abstract, has a lot of assumptions; a linear model assumes the relevant phenomenon is linear (almost never true, sometimes true enough), and looking at a betting market as a predictive model assumes the participants know what they are doing, the margins aren't too high, and there isn't anything too shady going on (not always true, sometimes true enough).

We can convert odds to probabilities by asking ourselves: if these odds were absolutely fair, how probable would the event have to be so neither side of the bet can expect to earn anything? (a reasonable definition of "fair" here, with historical links to the earliest developments of the concept of probability). Calling P the probability and L the odds, we can write this condition PL + (1-P)*0 = 1. The left side of the equation is how much you get on average — L when, with probability P, the event happens, and zero otherwise — and the right side says that on average you should get you dollar back, without winning or losing anything. From there it's obvious that P = \frac{1}{L}. For example, the odds above, if absolutely fair (which they never are, not completely, as people in the industry have to eat) would imply a probability for Juventus to win of 66.2%, and for Fiorentina of 11.6% (for the record, Juventus won, 2-1).

In this way we can put information into the betting platform (actually, the participants do), and read out probabilities. That's all we need to use it as a predictive model, and there's in fact a small industry dedicated to building betting markets tailored to predict all sorts of events, like political outcomes; when built with this use in mind, they are called prediction or information markets. The question, as with any model, isn't if it's true or not — unlike statistical models, betting markets don't have any misleading aura of mathematical certainty — but rather how good those probabilities are.

One natural way of answering that question is to compare our model with another one. Is this fancy machine learning model better than the spreadsheet we already use? Is this consultant better than this other consultant? Is this cab driver better at predicting games than that analyst on TV? Language gets very confusing very quickly, so mathematical notation becomes necessary here. Using the standard notation  P[x | y] for how likely do I think is that x will happen if y is true?, we can compare the cab driver and the TV analyst by calculating

 \frac{P[ \textrm{the game results we saw} | \textrm{the cab driver knows what she's talking about}]}{P[\textrm{the game results we saw} | \textrm{the TV analyst knows what he's talking about}]}

If that ratio is higher than one, this means of course that the cab driver is better at predicting games than the TV analyst, as she gave higher probabilities to the things that actually happened, and vice versa. This ratio is called the Bayes factor.

In our case, the factors are easy to calculate, as P[\textrm{home win} | \textrm{odds are good predictors}] is just \textrm{probability of a home win as implied by the odds}, which we already know how to calculate. And because the probabilities of independent events are the product of the individual probabilities, then

P[\textrm{any sequence of game results}|\textrm{odds are good predictors}] = \prod P[\textrm{probability of each result as implied by the odds}]

In reality, those events aren't independent, but we're assuming participants in the betting market take into account information from previous games, which is part of what "knowing what you're talking about" intuitively means.

Note how we aren't calculating how likely a model is, just which one of one two models has more support from the data we're seeing. To calculate the former value we'd need more information (e.g., how much you believed the model was right before looking at the data). This is a very useful analysis, particularly when it comes to making decisions, but often the first question is a comparative one.

Using our data set, we'll compare the betting market as a predictive model against a bunch of dart-throwing chimps as a predictive model (dart-throwing chimps are a traditional device in financial analysis). The chimps throw darts against a wall covered with little Hs, Ds, and As, so they always predict each event has a probability of \frac{1}{3}. Running the numbers, we get

 \textrm{odds vs chimps} = \frac{\prod P[\textrm{probability of each result as implied by odds}]}{ \frac{1}{3}^{\textrm{number of games}}} = e^{4312.406}

This is (much) larger than one, so the evidence in the data favors the betting market over the chimps (very; see the link above for a couple of rules of thumb about interpreting those numbers). That's good, and not something to be taken for granted: many stock traders underperform chimps. Note that if one model is better than another, the Bayes factor comparing them will keep growing as you collect more observations and therefore become more certain of it. If you make the above calculation with a smaller data set, the resulting Bayes factor will be lower.

Are odds also better in this sense than just using a rule of thumb about how frequent each event is? In this data set, the home team wins about 44.3% of the time, and the visitors 29%, so we'll assign those outcome probabilities to every match.

 \textrm{odds vs rule of thumb} = \frac{\prod P[\textrm{probability of each result as implied by odds}]}{ \prod P[\textrm{probability of each result as implied by the rule of thumb}]   } = e^{3342.303}

That's again overwhelming evidence in favor of the betting market, as expected.

We have statistics, soothsayers, and simians (chimpanzees aren't simians, but I couldn't resist the alliteration). What about the Scottish?

Lets look at how better than chimps are the odds for different countries and leagues or divisions (you could say that the chimps are our null hypothesis, but the concept of null hypothesis is at best a confusing and at worst a dangerous one: quoting the Zen of Python, explicit is better than implicit). The calculations will be the same, applied to subsets of the data corresponding to each division. A difference is that we're going to show the logarithm of the Bayes factor comparing the model implied by the odds and the model from the dart-throwing chimps (otherwise numbers become impractically large), and this divided by the number of game results we have for each division. Why that division? As we said above, if one model is better than another, the more observations you accumulate, the higher the amount of evidence for one over the other you're going to get. It's not that the first model is getting better over time, it's just that you're getting more evidence that it's better. In other words, if model A is slightly better than model B but you have a lot of data, and model C is much better than model D but you only have a bit of data, then the Bayes factor between A and B can be much larger than the one between C and D: the size of an effect isn't the same thing as your certainty about it.

By dividing the (logarithm of) the Bayes factor by the number of games, we're trying to get a rough idea of how good the odds are, as models, comparing different divisions with each other. This is something of a cheat — they aren't models of the same thing! — but by asking of each model how quickly they build evidence that they are better than our chimps, we get a sense of their comparative power (there are other, more mathematically principled ways of doing this, and to a degree the method you choose has to depend on your own criteria of usefulness, which depends on what you'll use the model for, but this will suffice here).

I'm following here the naming convention for divisions used in the data set: E0 is the English Premier League, E1 is their Championship, etc (the larger the number, the "lower" the league), and the country prefixes are: E for England, SC for Scotland, D for Germany, I for Italy, SP for Spain, F for France, N for the Netherlands, B for Belgium, P for Portugal, T for Turkey, and G for Greece. There's quite a bit of heterogeneity inside each country, but with clear patterns. To make them clearer, let's sort the graph by value instead of division, and keep only the lowest and highest five:

The betting odds generate better models for the top leagues of Greece, Portugal, Spain, Italy, and England, and worse ones for the lower leagues, with the very worst modeled one being SC3 (properly speaking, the Scottish League Two – there are the Scottish). This makes sense: the larger leagues have a lot of bettors who want in, many of them professionals, so the odds are going to be more informative.

To go back to the beginning: everything that gives you probabilities about the future is a predictive model. Just because one is a betting market and the other is a chimpanzee, or one is a consultant and the other one is a regression model, it doesn't mean they can't and shouldn't be compared to each other in a meaningful way. That's why it's so critical to save the guesses and predictions of every software model and every "human predictor" you work with. It lets you go back over time and ask the first and most basic question in predictive data science:

How much better is this program or this guy than a chimp throwing darts?

When you think about it, is that really a question you would want to leave unanswered about anything or anybody you work with?

The post-Westphalian Hooligan

Last Thursday's unprecedented incidents at one of the world's most famous soccer matches illustrate the dark side of the post- (and pre-) Westphalian world.

The events are well known, and were recorded and broadcasted in real time by dozens of cameras: one or more fans of Boca Juniors managed to open a small hole in the protective plastic tunnel through which River Plate players were exiting the field at the end of the first half, and managed to attack some of them with, it's believed, both a flare and a chemical similar to mustard gas, causing vision problems and first-degree burns to some of the players.

After this, it took more than an hour for match authorities to decide to suspend the game, and more than another hour for the players to leave the field, as police feared the players might be injured by the roughly two hundred fans chanting and throwing projectiles from the area of the stands from which they had attacked the River Plate players. And let's not forget the now mandatory illegal drone that was flown over the field controlled by a fan in the stands.

The empirical diagnosis of this is unequivocal: the Argentine state, as defined and delimited by its monopoly of force in its territory, has retreated from soccer stadiums. The police force present in the stadium — ten times as numerous as the remaining fans — could neither prevent, stop, nor punish their violence, or even force them to leave the stadium. What other proof can be required of a de facto independent territory? This isn't, as club and security officers put it, the work of a maladjusted few, or even an irrational act. It's the oldest and most effective form of political statement: Here and now, I have the monopoly of force. Here and now, this is mine.

What decision-makers get in exchange for this territorial grant, and what other similar exchanges are taking place, are local details for a separate analysis. This is the darkest and oldest part of the post-Westphalian characteristic development of states relinquishing sovereignty over parts of their territory and functions in exchange for certain services, in partial reversal to older patterns of government. It might be to bands of hooligans, special economic zones, prison gangs, or local or foreign militaries. The mechanics and results are the same, even in nominally fully functional states, and there is no reason to expect them the be universally positive or free of violence. When or where has it been otherwise in world history?

This isn't a phenomenon exclusive to the third world, or to ostensibly failed states, particularly in its non-geographical manifestations: many first world countries have effectively lost control of their security forces, and, taxing authority being the other defining characteristic of the Westphalian state, they have also relinquished sovereignty over their biggest companies, which are de facto exempt from taxation.

This is how the weakening of the nation-state looks like: not a dozen new Athens or Florences, but weakened tax bases and fractal gang wars over surrendered state territories and functions, streamed live.

The Premier League: United vs. City championship chances

Using the same model as previous posts (and, I'd say, not going against any intuition), the leading candidate to winning the Premier League is Manchester United, with approx. 88% chances. Second is Manchester City, with a bit over 11%. The rest of the teams with nonzero chances: Arsenal, Chelsea, Everton, Liverpool, Tottenham, and West Brom (with Chelsea, the best-positioned of these dark horses, clocking in at about half of a percentage point).

Personally, I'm happy about these very low-odds teams; I don't think any of them is likely to win (that's the point), but on the other hand, they have mathematical chances of doing so, and it's important for a model never to give zero probability to non-impossible events (modulo whatever precision you are working with, of course).

Barcelona and the Liga, or: Quantitative Support for Obvious Predictions

I've adapted the predictive model to look at the Spanish Liga. Unsurprisingly, it's currently giving Barcelona a 96.7% chance of winning the title, with Atlético a far second place with 3.1%, and Real Madrid less than 0.2% (I believe the model still underestimates small probabilities, although it has improved in this regard.)

Note that around the 9th round or so, the model was giving Atlético an slightly higher chance of winning the tournament than Barcelona's, although that window didn't last more than a round.

The Torneo Inicial 2012 in one graph (and 20 subgraphs)

Here's a graph showing how the probability of winning the Argentinean soccer championship changed over time for each team (time goes from left to right, and probability goes from 0 at the bottom to 1 at the top). Click on the graph to enlarge:

Hindsight being 20/20, it's easy to read too much into this, but it's interesting to note that some qualitative features of how journalism narrated the tournament over time are clearly reflected in these graphs: Velez' stable progression, Newell's likelihood peak mid-tournament, Lanús quite drastic drop near the end, and Boca's relatively strong beginning and disappointing follow-through.

As an aside, I'm still sure that the model I'm using handles low-probability events wrong; e.g., Boca still had mathematical chances almost until the end of the tournament. That's something I'll have to look into when I have some time.

Soccer, Monte Carlo, and Sandwiches

As Argentina's Torneo Inicial begins its last three rounds, let's try to compute the probabilities of championship for each team. Our tools will be Monte Carlo and sandwiches.

The core modeling issue is, of course, trying to estimate the odds of team A defeating team B, given their recent history in the tournament. Because of the tournament format, teams only face each other one per tournament, and, because of the recent instability of teams and performance, generally speaking, performances in past tournaments won't be very good guides (this is something that would be interesting to look at in more detail). We'll use the following oversimplifications intuitions to make it possible to compute quantitative probabilities:

  • The probability of a tie between two teams is a constant that doesn't depend on the teams.
  • If team A played and didn't lose against team X, and team X played and didn't lose against team B, this makes it more likely than team A won't lose against team B (e.g., a "sandwich model").

Guided by these two observations, we'll take the results of the games in which a team played against both A and B as samples from a Bernoulli process with unknown parameter, and use this to estimate the probability of any previously unobserved game.

Having a way to simulate a given match that hasn't been played yet, we'll calculate the probability of any given team wining the championship by simulating the rest of the championship a million times, and observing in how many of these simulations each team wins the tournament.

The results:

Team Championship probability
Vélez Sarfield 79.9%
Lanús 20.1%

Clearly our model is overly rigid — it doesn't feel at all realistic to say that those two teams are the only with any change of winning the champsionship. On the other hand, the balance of probabilities between both teams seems more or less in agreement with the expectations of observers. Given that the model we used is very naive, and only uses information from the current tournament, I'm quite happy with the results.

A Case in Stochastic Flow: Bolton vs Manchester City

A few days ago the Manchester City Football Club released a sample of their advanced data set, an xml file giving a quite detailed description of low-level events in last year's August 21 Bolton vs. Manchester City game, which was won by the away team 3-2. There's an enormous variety of analyses that can be performed with this data, but I wanted to start with one of the basic ones, the ball's stochastic flow field.

The concept underlying this analysis is very simple. Where the ball will be in the next, say, ten seconds, depends on where it is now. It's more likely that it'll be near than it is that it'll be far, it's more likely that it'll be on an area of the field where the team with possession is focusing their attack, and so on. Thus, knowing the probabilities for where the ball will be starting from each point in the field — you can think of it as a dynamic heat map for the future — together with information about where it spent the most time, gives us information about how the game developed, and the teams' tactics and performance.

Sadly, a detailed visualization of this map would require at least a four-dimensional monitor, so I settled for a simplified representation, splitting the soccer field in a 5x5 grid, and showing the most likely transitions for the ball from one sector of the field to another. The map is embedded below; do click on it to expand it, as it's not really useful as a thumbnail.

Remember, this map shows where the ball was most likely to go from each area of the field; each circle represents one area, with the circles at the left and right sides representing the area all the way to the end lines. Bigger circles signal that the ball spent more time in that area, so, e.g., you can see that the ball spent quite a bit of time in the midfield, and very little on the sides of Manchester City's defense line. The arrows describe the most likely movements of the ball from one area to another; the wider the line, the most likely the movement. You can see how the ball circulated side-to-side quite a bit near Bolton's goal, while Manchester City kept the ball moving further away from their goal.

There are many immediate questions that come to mind, even with such a simplified representation. How does this map look according to which team had possession? How did it change over time? What flow patterns are correlated with good or bad performance on the field? The graph shows the most likely routes for the ball, but which ones were the most effective, that is, more likely to end up in a goal? Because scoring is a rare event in soccer, particularly compared with games like tennis or american football, this kind of analysis is specially challenging, but also potentially very useful. There's probably much that we don't know yet about the sport, and although data is only an adjunct to well-trained expertise, it can be a very powerful one.

How Rooney beats van Persie, or, a first look at Premier League data

I just got one of the data sets from the Manchester City analytics initiative, so of course I started dipping my toe in it. The set gives information aggregated by player and match for the 2011-2012 Premier League, in the form of a number of counters (e.g. time played, goals, headers, blocked shots, etc); it's not the really interesting data set Manchester City is about to release (with, e.g., high-resolution position information for each player), but that doesn't mean there aren't interesting things to be gleaned from it.

The first issue I wanted to look at is probably not the most significant in terms of optimizing the performance of a team, but it's certainly one of the most emotional ones. Attackers: Who's the best? Who's underused? Who sucks?

If you look at total goals scored, the answer is easy: the best attackers are van Persie (30 goals), Rooney (27 goals), and Agüero (23 goals). Controlling by total time played, though, Berbatov and both Cissés have been quite more efficient in goals scored by minute played. They are also, not coincidentally, the most efficient scorers in terms of goals per shoot (both on and off target). The 30 goals of van Persie, for example, are more understandable when you see that he shot 141 times for a goal, versus Berbatov's 15.

To see how shooting efficiency and shooting volume (number of shoots) interact with each other, I made this scatterplot of goals per shoot versus shoots per minute, restricted to players who regularly shoot to avoid low-frequency outliers (click to expand).

You can see that most players are more or less uniformly distributed in the lower-left quadrant of low shooting volume and low shooting efficiency — people who are regular shooters, so they don't try too often or too seldom. But there are outliers, people who shoot a lot, or who shoot really well (or aren't as closely shadowed by defenders)... and they aren't the same. This suggests a question: Who should shoot less and pass more? And who should shoot more often and/or get more passes?

To answer that question (to a very sketchy first degree approximation), I used the data to estimate a lost goals score that indicates how many more goals per minute could be expected if the player made a successful pass to an average player instead of shooting for a goal (I know, the model is naive, there are game (heh) theoretic considerations, etc; bear with me). Looking at the players through this lens, this is a list of players who definitely should try to pass a bit more often: Andy Carroll, Simon Cox, and Shaun Wright-Phillips.

Players who should be receiving more passes and making more shots? Why, Berbatov and both Cissés. Even Wayne Rooney, the league's second most prolific shooter, is good enough turning attempts into goals that he should be fed the ball more often, rather than less.

The second-order question, and the interesting one for intra-game analysis, is how teams react to each other. To say that Manchester United should get the ball to Rooney inside strike distance more often, and that opposing teams should try to prevent this, is as close to a triviality as can be asserted. But whether or not an specific change to a tactical scheme to guard Rooney more closely will be a net positive or, by opening other spaces, backfire... that will require more data and a vastly less superficial analysis.

And that's going to be so much fun!