Statistics, Simians, the Scottish, and Sizing up Soothsayers

A predictive model can be a parametrized mathematical formula, or a complex deep learning network, but it can also be a talkative cab driver or a slides-wielding consultant. From a mathematical point of view, they are all trying to do the same thing, to predict what's going to happen, so they can all be evaluated in the same way. Let's look at how to do that by poking a little bit into a soccer betting data set, and evaluating it as if it were an statistical model we just fitted.

The most basic outcome you'll want to predict in soccer is whether a game goes to the home team, the visitors or away team, or is a draw. A predictive model is anything and anybody that's willing to give you a probability distribution over those outcomes. Betting markets, by giving you odds, are implicitly doing that: the higher the odds, the less likely they think is the outcome.

The Football-Data.co.uk data set we'll use contains results and odds from various soccer leagues for more than 37,000 games. We'll use the odds for the Pinnacle platform whenever available (those are closing odds, the last ones available before the game).

For example, for the Juventus-Fiorentina game in August 20, 2016, the odds offered were 1.51 for a Juventus win, 4.15 for a draw (ouch), and 8.61 for a Fiorentina victory (double ouch). Odds of 1.51 for Juventus mean that for each dollar you bet on Juventus, you'd get USD 1.51 if Juventus won (your initial bet included) and nothing if it didn't. These numbers aren't probabilities, but they imply probabilities. If platforms gave odds too high relative to the event's probability they'd go broke, while if they gave odds too low they wouldn't be able to attract bettors. On balance, then, we can read from the odds probabilities slightly lower than the the betting market's best guesses, but, in a world with multiple competing platforms, not really that far from the mark. This sounds like a very indirect justification for using them as a predictive model, but every predictive model, no matter how abstract, has a lot of assumptions; a linear model assumes the relevant phenomenon is linear (almost never true, sometimes true enough), and looking at a betting market as a predictive model assumes the participants know what they are doing, the margins aren't too high, and there isn't anything too shady going on (not always true, sometimes true enough).

We can convert odds to probabilities by asking ourselves: if these odds were absolutely fair, how probable would the event have to be so neither side of the bet can expect to earn anything? (a reasonable definition of "fair" here, with historical links to the earliest developments of the concept of probability). Calling P the probability and L the odds, we can write this condition PL + (1-P)*0 = 1. The left side of the equation is how much you get on average — L when, with probability P, the event happens, and zero otherwise — and the right side says that on average you should get you dollar back, without winning or losing anything. From there it's obvious that P = \frac{1}{L}. For example, the odds above, if absolutely fair (which they never are, not completely, as people in the industry have to eat) would imply a probability for Juventus to win of 66.2%, and for Fiorentina of 11.6% (for the record, Juventus won, 2-1).

In this way we can put information into the betting platform (actually, the participants do), and read out probabilities. That's all we need to use it as a predictive model, and there's in fact a small industry dedicated to building betting markets tailored to predict all sorts of events, like political outcomes; when built with this use in mind, they are called prediction or information markets. The question, as with any model, isn't if it's true or not — unlike statistical models, betting markets don't have any misleading aura of mathematical certainty — but rather how good those probabilities are.

One natural way of answering that question is to compare our model with another one. Is this fancy machine learning model better than the spreadsheet we already use? Is this consultant better than this other consultant? Is this cab driver better at predicting games than that analyst on TV? Language gets very confusing very quickly, so mathematical notation becomes necessary here. Using the standard notation  P[x | y] for how likely do I think is that x will happen if y is true?, we can compare the cab driver and the TV analyst by calculating

 \frac{P[ \textrm{the game results we saw} | \textrm{the cab driver knows what she's talking about}]}{P[\textrm{the game results we saw} | \textrm{the TV analyst knows what he's talking about}]}

If that ratio is higher than one, this means of course that the cab driver is better at predicting games than the TV analyst, as she gave higher probabilities to the things that actually happened, and vice versa. This ratio is called the Bayes factor.

In our case, the factors are easy to calculate, as P[\textrm{home win} | \textrm{odds are good predictors}] is just \textrm{probability of a home win as implied by the odds}, which we already know how to calculate. And because the probabilities of independent events are the product of the individual probabilities, then

P[\textrm{any sequence of game results}|\textrm{odds are good predictors}] = \prod P[\textrm{probability of each result as implied by the odds}]

In reality, those events aren't independent, but we're assuming participants in the betting market take into account information from previous games, which is part of what "knowing what you're talking about" intuitively means.

Note how we aren't calculating how likely a model is, just which one of one two models has more support from the data we're seeing. To calculate the former value we'd need more information (e.g., how much you believed the model was right before looking at the data). This is a very useful analysis, particularly when it comes to making decisions, but often the first question is a comparative one.

Using our data set, we'll compare the betting market as a predictive model against a bunch of dart-throwing chimps as a predictive model (dart-throwing chimps are a traditional device in financial analysis). The chimps throw darts against a wall covered with little Hs, Ds, and As, so they always predict each event has a probability of \frac{1}{3}. Running the numbers, we get

 \textrm{odds vs chimps} = \frac{\prod P[\textrm{probability of each result as implied by odds}]}{ \frac{1}{3}^{\textrm{number of games}}} = e^{4312.406}

This is (much) larger than one, so the evidence in the data favors the betting market over the chimps (very; see the link above for a couple of rules of thumb about interpreting those numbers). That's good, and not something to be taken for granted: many stock traders underperform chimps. Note that if one model is better than another, the Bayes factor comparing them will keep growing as you collect more observations and therefore become more certain of it. If you make the above calculation with a smaller data set, the resulting Bayes factor will be lower.

Are odds also better in this sense than just using a rule of thumb about how frequent each event is? In this data set, the home team wins about 44.3% of the time, and the visitors 29%, so we'll assign those outcome probabilities to every match.

 \textrm{odds vs rule of thumb} = \frac{\prod P[\textrm{probability of each result as implied by odds}]}{ \prod P[\textrm{probability of each result as implied by the rule of thumb}]   } = e^{3342.303}

That's again overwhelming evidence in favor of the betting market, as expected.

We have statistics, soothsayers, and simians (chimpanzees aren't simians, but I couldn't resist the alliteration). What about the Scottish?

Lets look at how better than chimps are the odds for different countries and leagues or divisions (you could say that the chimps are our null hypothesis, but the concept of null hypothesis is at best a confusing and at worst a dangerous one: quoting the Zen of Python, explicit is better than implicit). The calculations will be the same, applied to subsets of the data corresponding to each division. A difference is that we're going to show the logarithm of the Bayes factor comparing the model implied by the odds and the model from the dart-throwing chimps (otherwise numbers become impractically large), and this divided by the number of game results we have for each division. Why that division? As we said above, if one model is better than another, the more observations you accumulate, the higher the amount of evidence for one over the other you're going to get. It's not that the first model is getting better over time, it's just that you're getting more evidence that it's better. In other words, if model A is slightly better than model B but you have a lot of data, and model C is much better than model D but you only have a bit of data, then the Bayes factor between A and B can be much larger than the one between C and D: the size of an effect isn't the same thing as your certainty about it.

By dividing the (logarithm of) the Bayes factor by the number of games, we're trying to get a rough idea of how good the odds are, as models, comparing different divisions with each other. This is something of a cheat — they aren't models of the same thing! — but by asking of each model how quickly they build evidence that they are better than our chimps, we get a sense of their comparative power (there are other, more mathematically principled ways of doing this, and to a degree the method you choose has to depend on your own criteria of usefulness, which depends on what you'll use the model for, but this will suffice here).

I'm following here the naming convention for divisions used in the data set: E0 is the English Premier League, E1 is their Championship, etc (the larger the number, the "lower" the league), and the country prefixes are: E for England, SC for Scotland, D for Germany, I for Italy, SP for Spain, F for France, N for the Netherlands, B for Belgium, P for Portugal, T for Turkey, and G for Greece. There's quite a bit of heterogeneity inside each country, but with clear patterns. To make them clearer, let's sort the graph by value instead of division, and keep only the lowest and highest five:

The betting odds generate better models for the top leagues of Greece, Portugal, Spain, Italy, and England, and worse ones for the lower leagues, with the very worst modeled one being SC3 (properly speaking, the Scottish League Two – there are the Scottish). This makes sense: the larger leagues have a lot of bettors who want in, many of them professionals, so the odds are going to be more informative.

To go back to the beginning: everything that gives you probabilities about the future is a predictive model. Just because one is a betting market and the other is a chimpanzee, or one is a consultant and the other one is a regression model, it doesn't mean they can't and shouldn't be compared to each other in a meaningful way. That's why it's so critical to save the guesses and predictions of every software model and every "human predictor" you work with. It lets you go back over time and ask the first and most basic question in predictive data science:

How much better is this program or this guy than a chimp throwing darts?

When you think about it, is that really a question you would want to leave unanswered about anything or anybody you work with?