Soccer, messy data, and why I don't quite believe what this post says

Here’s the open secret of the industry: Big Data isn’t All The Data. It’s not even The Data You Thought You Had. By and large, we have good public data sets about things governments and researchers were already studying, and good private data sets about things that it’s profitable for companies to track. But that covers an astonishingly thin and uneven slice of our world. It’s bigger than it ever was, and it’s growing, but it’s still not nearly as large, or as usable, as most people think.

And because public and private data sets are highly specific side effects from other activities, each of them with its own conventions, languages, and even ontologies (in both the computer science and philosophical senses of the word), coordinating two or more of them together is at best a difficult and expensive manual process, and at worst impossible. Not all, but most data analysis case studies and applications end up focused on extracting as much value as possible from a given data set, rather than seeing what new things can be learned by putting that data in the context of the rest of the data we have about the world. Even the larger indexes of open data sets (very useful services that they are) end up being mostly collections of unrelated pieces of information, rather than growing knowledge bases about the world.

There’s a sort of informational version of Metcalfe’s law (maybe “the value of a group of data sets grows as the number of connections you can make between them”) that we are missing on, and that lies behind the promise of both linked data sets (still far in its early phase) and the big “universal” knowledge bases that aim at offering large, usable, interconnected sets of facts about as many different things as possible. They, or something like they, are a necessary part of the infrastructure to give computers the same boost in information access the Internet gave us. The bottleneck of large-scale inference systems like IBM’s Watson isn’t computer power, but rather rich, well-formatted data to work on.

To try and test the waters on the state of these knowledge bases, I set out to do a quick, superficial analysis of the careers of Argentine soccer players. There are of course companies that have records not only of players’ careers, but of pretty much every movement they have ever done on a soccer field, as well as fragmented public data sets collected by enthusiasts about specific careers or leagues. I wanted to see how far I could go using a single “universal” data set that I could later correlate with other information in an automated way. (Remember, the point of this exercise wasn’t to get the best data possible about the domain, but to see how good the data is when you restrict yourself to a single resource that can be accessed and processed in a uniform way.)

I went first for the best known “universal” structured data sources: Freebase and Wikidata. They are both well structured (XML and/or JSON) and of significant size (almost 2.9 billion facts and almost 14 million data items, respectively), but after downloading, parsing, and exploring each of them, I had to concede that neither was good enough: there were too many holes in the information to make an analysis, or the structure didn’t hold the information I needed.

So it was time for Plan C, which is always the worst idea except when you have nothing else, and even then it could still be: plain old text parsing. It wasn’t nearly as bad as it could have been. Wikipedia pages, like Messi’s have neat infoboxes that include exactly the simplified career information I wanted, and the page’s source code shows that they are written in what looks like a reasonable mini-language. It’s a sad comment on the state of the industry that even then I wasn’t hopeful.

I downloaded the full dump of Wikipedia; it’s 12GB of compressed XML (not much, considering what’s in there), so it was easy to extract individual pages. And because there is an index page of Argentine soccer players, it was even easy to keep only those, and then look at their infoboxes.

Therein lay the rub. The thing to remember about Wikipedia is that it’s written by humans, so even the parts that are supposed to have strict syntactic and formatting rules, don’t (so you can imagine what free text looks like). Infoboxes should have been trivial to parse, but they have all sorts of quirks that aren’t visible when rendered in a browser: inconsistent names, erroneous characters, every HTML entity or Unicode character that half-looks like a dash, etc, so parsing it became an exercise on handling special cases.

I don’t want to seem ungrateful: it’s certainly much, much, much better to spend some time parsing that data than having to assemble and organize it from original sources. Wikipedia is an astounding achievement. But every time you see one of those TV shows where the team nerds smoothly access and correlate hundreds of different public and private data sources in different formats, schemas, and repositories, finding matches between accounting records, newspaper items, TV footage, and so on… they lie. Wrestling matches might arguably be more realistic, if nothing else because they fall within the realm of existing weaponized chair technology.

In any case, after some wrestling of my own with the data, I finally had information about the careers of a bit over 1800 Argentine soccer players whose professional careers in the senior leagues began in 1990 or later. By this point I didn’t care very much about them, but for completeness’ sake I tried to answer a couple of questions: Are players less loyal to their teams than they used to be? And how soon can a player expect to be playing in one of the top teams?

To make a first pass at the questions, I looked at the number of years players spent in each team over time (averaged over players that began their careers on each calendar year).

The data (at least in such a cursory summary) doesn’t support the idea that newer players are less loyal to their teams, as they don’t spend significantly less amount of time in them. Granted, this loyalty might be to their paychecks rather than to the clubs themselves, but they aren’t moving between clubs any faster than they used to.

The other question I wanted to look at was how fast players get to top teams. This is actually an interesting question in a general setting; characterizing and improving paths to expertise, and thereby improving how much, quickly, and well we all learn, is one of the still unrealized promises of data-driven practices. To take a quick look at this, I plotted the probability of playing for a top ten team (based on the current FIFA club ratings, so they include Barcelona, Real Madrid, Bayern Munich, etc) by career year, normalized by the probability of starting your professional career in one of those teams.

Despite the large margins of error (reasonable given how few players actually reach those teams), the curve does seem to suggest a large increase in the average probability during the first three or four years, and then an stable probability until the ninth or tenth year, at which it peaks. The data is too noisy to make any definite conclusions (more on that below), but, with more data, I would want to explore the possibility of there being two paths to the top teams, corresponding to two sub-groups of highly talented players: either explosive young talents that are quickly transferred to the top teams, and solid professionals that accumulate experience and reach those teams at the peak of their maturity and knowledge.

It’s a nice story, and the data sort of fits, but when I look at all the contortions I had to make to get the data, I wouldn’t want to put much weight on it. In fact, I stopped myself from doing most of the analysis I wanted to do (e.g., Can you predict long-term career paths from their beginning? There’s an interesting agglomerative algorithm for graph simplification that has come handy in the analysis of online game play, and I wanted to see how it fares for athletes). I didn’t not because the data doesn’t support it, but because the risk of systematic parsing errors, biases due to notability (do all Argentine players have a Wikipedia page? I think so, but how to be sure?), etc.

Of course, if this were a paid project it wouldn’t be difficult to put together the resources to check the information, compensate for biases, and so on. But every thing that needs to be a paid project to be done right is something that we can’t consider an ubiquitous resource (imagine building the Internet with pre-Linux software costs for operating systems, compilers, etc, including the hugely higher training costs that would come from losing generations of sysadmins and programmers that began practicing on their own at a very early age). Although we’re way ahead of where we were a few years ago, we’re still far from where we could, and probably need, to be. Right now you need knowledgeable (and patient!) people to make sure data is clean, understandable, and makes sense, even data that you have collected yourself; this makes data analysis a per-project service, rather than a universal utility, and one relatively very expensive as you increase the number of interrelated data sets you need to use. Although the difference of cost is only quantitative, the difference in cumulative impact isn’t.

The frustrating bit is that we aren’t too far from that (on the other hand, we’ve been twenty years away from strong A.I. and commercial nuclear fusion since before I was born): there are tools that automate some of this work, although they have their own issues and can’t really be left on their own. And Google, as always, is trying to jump ahead of everybody else, with its Knowledge Vault project attempting to build an structured facts database out of the entirety of the web. If they, or somebody else, succeeds at this, and if this is made available at utility prices… Well, that might make those TV shows more realistic — and change our economy and society at least as much as the Internet itself did.