A few days ago the Manchester City Football Club released a sample of their advanced data set, an xml file giving a quite detailed description of low-level events in last year's August 21 Bolton vs. Manchester City game, which was won by the away team 3-2. There's an enormous variety of analyses that can be performed with this data, but I wanted to start with one of the basic ones, the ball's stochastic flow field.
The concept underlying this analysis is very simple. Where the ball will be in the next, say, ten seconds, depends on where it is now. It's more likely that it'll be near than it is that it'll be far, it's more likely that it'll be on an area of the field where the team with possession is focusing their attack, and so on. Thus, knowing the probabilities for where the ball will be starting from each point in the field — you can think of it as a dynamic heat map for the future — together with information about where it spent the most time, gives us information about how the game developed, and the teams' tactics and performance.
Sadly, a detailed visualization of this map would require at least a four-dimensional monitor, so I settled for a simplified representation, splitting the soccer field in a 5x5 grid, and showing the most likely transitions for the ball from one sector of the field to another. The map is embedded below; do click on it to expand it, as it's not really useful as a thumbnail.
Remember, this map shows where the ball was most likely to go from each area of the field; each circle represents one area, with the circles at the left and right sides representing the area all the way to the end lines. Bigger circles signal that the ball spent more time in that area, so, e.g., you can see that the ball spent quite a bit of time in the midfield, and very little on the sides of Manchester City's defense line. The arrows describe the most likely movements of the ball from one area to another; the wider the line, the most likely the movement. You can see how the ball circulated side-to-side quite a bit near Bolton's goal, while Manchester City kept the ball moving further away from their goal.
There are many immediate questions that come to mind, even with such a simplified representation. How does this map look according to which team had possession? How did it change over time? What flow patterns are correlated with good or bad performance on the field? The graph shows the most likely routes for the ball, but which ones were the most effective, that is, more likely to end up in a goal? Because scoring is a rare event in soccer, particularly compared with games like tennis or american football, this kind of analysis is specially challenging, but also potentially very useful. There's probably much that we don't know yet about the sport, and although data is only an adjunct to well-trained expertise, it can be a very powerful one.