# Using machine learning to read Sherlock Holmes

A while ago I posted about how to use machine learning to understand brand semantics by mining Twitter data — not just to count mentions, but to map the similitudes and differences in how people think about them. But individual tweets are brief snapshots, just a few words written and posted in an instant of time. We can use the same methods to understand the flow of meaning along longer texts, to seek patterns in and between stories.

For a quick first example, I downloaded three books of Sherlock Holmes short stories: The Adventures of Sherlock Holmes, The Memoirs of Sherlock Holmes, and The Return of Sherlock Holmes. The main reason is that I like them. Secondary reasons are that Holmes needs no introduction, and that the relatively stable structure of the plots makes it more likely the algorithms will have something to work with.

After extracting the thirty stories included in the books and splitting each story into its component paragraphs, I ran each paragraph through Facebook Research's InferSet sentence embedding library. Similar to its counterparts from Google and elsewhere, it converts a text fragment into a point in an abstract 4096-dimensional space, in such a way that fragments with statistically similar meanings are mapped to points close to each other, and the geometric relationships between points encode their semantic differences (if this sounds a bit vague, the link above works through a concrete example, although mapping individual words instead of longer fragments).

The first question I wanted to ask the data — and the only one in this hopefully short post — is the most basic of any plot: are we at the beginning or at the end of the story? Even ignoring obvious marks like "Once upon a time..." and "The End", readers of any genre quickly develop a fairly robust sense of how plots work, whether it's a detective story or a romantic one. We have expectations about beginnings, middles, and endings that might be subverted by writers, but only because we have them. At the same time, it's a "soft" cultural concept, of the type not traditionally seen as amenable to statistical analysis.

As Holmes would have suggested, I went to it in a methodical manner. The first order of business was, as always, to define the problem. I had between one and three hundred abstract points (one for each paragraph) for each story. To clarify what counts as beginning, middle, and end, I split each story into five segments — the first fifth of paragraphs, the second fifth of paragraphs, and so on — and took the first, third, and last segments as the story's beginning, middle, and end. As usual when using this sort of embedding technique, I summarized each segment simply by taking the average of all the paragraphs/points in it (that's not always the best-performing technique, but works as a reasonable baseline).

So now I had reduced the entire text of each story into three abstract points, each point a collection of 4096 number. Watson would doubtless have been horrified. Did all this statistical scrambling leave us with enough information to tell, without cheating, which points are beginnings, middles, and ends?

Plots in 4096 aren't always clear, but a two-dimensional isometric embedding wasn't very promising:

There are thirty points of each type, one for each story, but, as you can see, there isn't any clear pattern that would allow us to say, if it weren't for the color coding, wich ones are beginnings, middles, or ends. (Of course, this doesn't prove there isn't an obvious pattern in their original 4096-dimensional space; but we're trying for now to explore human-friendly ways to probe the data.)

There was still hope, though. After all, our intuition for plots aren't based exactly on what's being told at each point, but on how it changes along the story; the same marriage that ends a romantic comedy can be the prologue of an slowly unfolding drama.

Fortunately, we can use the abstract points we extracted to calculate differences between texts. Just as the difference between the vectors for the words "king" and "queen" is similar to the difference between the vectors for "man" and "woman" (and therefore encode, in a way, the semantics of the difference in gender, at least in the culture- and context-specific ways present in the language patterns used to train the algorithm), the difference between the summary vectors for the beginning, middle, and end of stories encode... something, hopefully.

And, indeed, they do:

Let's be careful to understand what the evidence is telling us. Each point encodes mathematically the difference between the summary meanings of two fragments of text. The red points describe statistically the path between the beginning and middle of a story, and the green ones betwee that middle and the end. The fact that red and green points are pretty well grouped, even when plotted in two instead of their native 4096 dimensions, indicates that, even after we reduced Doyle's prose to a few numbers, there's still enough statistical structure to distinguish the path, as it were, between the beginning of a story and its middle, and thence to its resolution. In other words, just as online advertisers and political analysts do "sentiment analysis" as a matter of course, it's also possible to do "plot analysis."

A final observation. Low-dimensional clusters are seldom perfectly clean, but there's an exception in the graph above that merits a closer look:

The points highlighted correspond to the beginning-to-middle and middle-to-end transitions of The Crooked Man. The later transition is reasonably positioned in our graph, but why does the first half of the story look, to our algorithm's cold and unsympathetic eyes, as a second half? At first blush it looks like a serious blunder.

It could be; explaining and verifying the output of any non-trivial model requires careful analysis beyond the scope of this quick post. But if you read the story (I linked to it above), you'll note that the middle of it, which begins with Watson's poetic "A monkey, then?", continues with Holmes explaining a series of inferences he made. That's usually how Doyle closes his plots, and in fact the rest of the story is just Holmes and Watson taking a train and interrogating somebody.

This case suggests that our quick analysis wasn't entirely off the mark: the algorithm picked a real pattern in Doyle's writing, and raised an alarm when the author strayed from it.

The family of algorithms in this post descends from those that revolutionized computer translation years ago; since then, they have continuously helped build tools that help computers understand written and spoken language, as well as images, in ways that complement and extend what humans do. Quantitative approaches to text structure will certainly open new opportunities as well.

# Using Artificial Intelligence to Understand Brands

Artificial Intelligence techniques behind automated translation and other NLP (Natural Language Processing) applications doesn't just work at the level of words and phrases, but give us quantitative data about the meaning people assign to different concepts, including brands. We can then measure relationships between different brands in a space, understanding what distinguishes how people talk about, or rather with, them.

As an example, we'll use data from the GloVe project at Stanford — in particular their Twitter- and Wikipedia-derived sets — to look at the main smartphone brands: Samsung, Apple, Nokia, Sony, LG, HTC, Motorola, and Huawei. It's a dictionary, but a very special one: instead of taking a term in English and mapping it to a Chinese one, it takes terms in English, Chinese, etc, and translates them to points in an abstract space, essentially sets of apparently meaningless numbers. But they are far from arbitrary; the software learns to map terms that are used in similar ways, regardless of how they are written, into points that are close in this abstract space. So bicycle, bicicleta, and 自行车 are translated to points that are close to each other, which is how a system like Google Translate knows how to go from the English term to the Spanish one — it just looks for the closest point that comes from a Spanish term (using neural networks to figure out this map is where the real difficulty lies, but that's for a different post).

We can leverage this into an intuitive but data-driven way of looking at the relationship between multiple brands. Just as words that have similar meanings are closer to each other than those that have different ones, brands that people think of as similar will be talked about in the same way, and so, because they are similar as words, will end up having close points in the abstract space. It sounds a bit... abstract, but here's how it looks for the main smartphone brands, using the mapping based on GloVe's crawling of about two billion tweets:

What this map is telling us is that, based on the way people actually use the brands as words when tweeting — in some senses, the "real" content of the brand — most smartphone brands are pretty much identical, with Apple (as expected), Huawei, and LG the odd ones out. An algorithm that told us that Samsung and Motorola are identical as brands wouldn't be a very perceptive one, and in fact that's not what's happening here. If we zoom into that cluster of brands, we see they are quite separate from each other:

What we're seeing is that, simply put, compared with Apple (and Huawei and LG) Samsung and Motorola are identical as brands; it's just when you zoom into the area of "major Android brands" that you can see that Samsung and Nokia are quite similar — compared to Motorola. It's all relative, but not in an arbitrary way.

So we've used AI to put in perspective the success of Samsung's branding efforts, or, in a more positive way, to highlight how Apple and a couple of other brands are on their own class, each of them separated as words from the pretty much homogeneous (unless you forget the competitors and zoom into them) central core of Android brands. But can we say something about the semantics of that difference?

Surprisingly, we can! Sort of. The biggest surprise of this algorithm — and it caused quite a stir in the AI community when it was first published — is that there's meaning not just in the distance between points, but also in their specific geometric relationships. The best way to explain it is by showing it:

Each of the words king, queen, man, woman, son, daughter gets its own point, as expected. The fascinating thing is that the arrow between king and queen is almost the same in length and direction as the arrow between man and woman, and they are also almost identical to the arrow between son and daughter! Somehow, the algorithm doesn't just learn a way to represent words as points, but also the abstract relationship female version of, which is "translated" into an arrow of specific angle and length; to know what the female version of son is, you just start from that point, do the same jump as you'd do to go from man to woman... and reach the point for daughter.

This is an incredibly powerful capability, because now we can ask not just which brands are comparatively similar or different, but in what way. We took a list of fifty common adjectives in English as our vocabulary, and asked the data

if king is to queen as man is to woman then
the generic Android brand is to Apple as... ? is to ?

The closest metaphors we got?

the generic Android brand is to Apple as black is to white

the generic Android brand is to Apple as international is to national

the generic Android brand is to Apple as special is to great

This doesn't mean people use the word national a lot when talking about Apple. It's subtler and more powerful: the "national-ness" that is the difference between the words international and national is similar to the difference between the way people use the words for the major Android brands, e.g. Samsung, and Apple. The white cases and Apple being an US company (and, arguably, their being great) are part of the difference in meaning between Apple and Samsung.

The most important thing about the above is how unsurprising it is if you pay attention to the brands; it's part of the discourse, yes, but the algorithm automatically teased those differences out and simplified them to the starkest meaningful metaphor. Apple is the white smartphone — you push billions of tweets in one side, process it carefully enough, and out comes a conceptual observation.

Artificial intelligence: it's not just about numbers anymore (and it never was).

Applying the same analysis to compare LG and Huawei against the rest of the Android brands gives us consistently the terms good (for LG) and strong (for Huawei), perhaps an indication of solid if not brilliant conceptual branding (remember, we're quantifying metaphors, not counting mentions — it's not the words in the advertising copy, but the way people use the brand in their own tweets).

But the important aspect of this technique is that we could just as easily have used a completely different vocabulary to query the relationship between brands — feelings instead of adjectives, or terms related to prices, or whatever vocabulary helps answer the specific question we wanted to ask. Brands, like every other word, are deeply multidimensional, and so are their relationships; rather than attempting to oversimplify them through the narrow lens of a specific survey, pulling vast amounts of actual usage data allows us to look at concrete answers to specific questions about the relationship between brands in all the messy complexity of the real world, yet distill that complexity into conceptually usable semantic relationships.

Forget counting retweets and classifying mentions: we now have the tools to look into the living reality of brands as part of our continuously shifting languages, and to apply quantitative methods to elucidate, and eventually shape, their conceptual and emotional overtones. As happened with metrics-driven, quantitatively optimized advertising, leveraging these tools will require expanding the conceptual and strategic toolsets of organizations in ways that to many will feel too alien to attempt, but which will eventually become part of the basic practices of the industry. Marketing, I believe, will become the richer for that, not just through increased transparency and effectiveness, but also by making possible the development of completely new means to achieve the oldest ends.

After all, what has marketing always been, if not the engineering of hidden metaphors?

# What's the healthiest city in the US? (and what does that even mean?)

(Spoiler alert: it's not Detroit, and it'd be a relatively simple question if it weren't for cancer.)

The human body has multiple ways of not working well; at its broadest, the CDC's 500 Cities project data set lists fourteen different health outcomes, from arthritis, to stroke, to the wonderfully phrased mental health not good. But, just as for an individual having a certain condition raises the probability of others (e.g., high blood pressure and stroke), there's also a high degree of correlation between how prevalent different conditions are in each city.

For example, and unsurprisingly, cities where high blood pressure is more frequent tend to see more strokes:

But we can do more than plot specific pairs of conditions; we can use this data to build a "family tree" of diseases, with closely related diseases tending to show up together in the same city:

Just as high blood pressure and strokes are closely related, so are diabetes and chronic kidney disease, or coronary heart disease and chronic obstructive pulmonary disease. Interestingly, pretty much all conditions are correlated with each other, except cancer. Leaving the prevalence of cancer aside for a minute, the other thirteen conditions in the CDC data set are correlated closely enough that a direct PCA process gives us a main component that explains almost 80% of their variance.

In simpler terms: we can build a single "general health index" number that predicts relatively well the prevalence in a given city of different kinds of health outcomes, from strokes to asthma.

Calculating this number using PCA and rescaling it to have zero mean and unit variance, we get a pleasantly well-distributed index:

This allows us to sort cities in order of "general healthiness":

• 1. San Ramon (CA)
• 2. Sunnyvale (CA)
• 3. Mountain View (CA)
• ...
• 498. Gary (IN)
• 499. Flint (IN)
• 500. Detroit (MI)

No big surprises there. Twelve of the top fifteen healthiest cities in the US, by this definition, are in California — habits, demography, income, everything helps, and the "California health nut" stereotype has, unlike most, data to back it up. The mirror image of this situation are cities like Detroit and Flint; one corollary of the fact that we do have a large body of public health knowledge is that, unlike in other historical periods, differences in population-level health outcomes are a function of economics and politics (in their broadest senses) rather than, or more than, biology.

However, recall that to build our elegantly simple "general health index" we had to put aside the not-so-small matter of cancer. It turns out that cancer plays by very different rules, as it becomes obvious when we plot its (similarly normalized) prevalence against our generic health index:

Cities like San Ramon (CA) and Mountain View (CA) have an average prevalence of cancer with respect to the rest of the country, but higher than much less healthy (in the everything *but* cancer sense) cities like Laredo (TX) and Brownsville (TX). Here youth seems to beat experience, income, and technology: Laredo and Brownsville have median ages of 28.2 and 27.7 years respectively, while San Ramon has a median age of 37.6 years, almost matching the US's overall metric of 37.8 years.

The lack of correlation between the prevalence of cancer and that of most everything else shows not only the profound differences in their physiological mechanisms, but also in the state of our medical technology. We know quite a bit about how to prevent things like diabetes, strokes, high blood pressure, etc. — by and large, they are different expressions of the same set of underlying physiological issues, which is part of the explanation of why they are so closely correlated across cities. Cancer is a different matter. It's less a single disease than a bewildering array of cellular insurrections, one on which we've made astounding strides during the last years, but still comparatively poorly understood.

This is an stark example of the difference between technological possibility and political outcomes: given the state of our technology, Flint and Mountain View should be equally healthy, as they differ in things we know how to improve, and are equal on the condition we are most powerless against.

The state of technology, of course, isn't static: inexcusably late but at last, medical researchers are beginning to approach aging as a root disease, and having practical ways of reversing some forms of basic physiological damage — the common mechanisms behind the "everything-but-cancer syndrome" — will lead to improvements in how we treat and prevent most conditions. And if cancer is the one we know the least about, it's also the one where our improving computational capabilities might help the most. But there's little difference between not having a technology and choosing not to use it; we have reasons to hope for significant improvements in technological possibilities during the next few decades, but a hazier plan for their public health impact.

# The normalcy of online learning: the more you study, the better you do

Online learning, after all, is just a form of learning: time spent studying is one of the best predictors of success.

Both the pattern (and the exceptions) can be seen quite clearly on the Open University Learning Analytics dataset, which collects anonymized data about the personal characteristics and, crucially, interactions with the Open University's Virtual Learning Environment (as counts of clicks by date) of 32,593 students registered in 22 courses; see the linked entry on Nature for a detailed description of the data set. For this quick exploratory analysis I chose to focus on students that either passed or failed their courses, ignoring those who withdrew along the way; the latter is a very frequent outcome in this kind of setting (31% of cases in the data set), but one that merits a separate analysis.

Of those students that completed the course, 68.6% passed it (13.5% of them with Distinction), and 31.4% failed. To what degree was this a matter of sheer effort?

Here the data supports what teachers and parents always say. Only about a third of the students who interacted with the learning platform between 10 and 23 days (the second decile of activity) passed the course, while 94% of those who did it between 120 and 155 days (the ninth decile) did. This is perhaps an obvious effect, but it's noteworthy that even among the highest deciles of activity, more activity leads to a better result: moving from the eight to the ninth decile of activity — from, say, 110 days of activity to 140 — raises the probability of passing the test an extra five percent.

There are things we can say about the probability of somebody passing the course before it begins. Most significantly, the probability of passing the course among students who finish it grows strongly with the already achieved educational level of the student (note that this date refers to the UK educational system).

There's nothing mysterious about the mechanics of it. By and large, better-educated students interact more often with the platform, and the extra days explain much of the variability in outcome.

This is the point where reading the data becomes tricky, and domain experience and a healthy dosis of skepticism become useful. There's both a correlation and a reasonable mechanism of influence between studying more days and getting a better outcome, which — as an hypothesis to guide interventions — suggests we should attempt to get students to interact with the platform more often. But understanding why they already don't do it on their own is critical to understanding what would help, and that's not necessarily obvious from this data. For example, one possibility is that students simply underestimate how many days of study they'll need in order to get a reasonable chance of passing the course; if that's the case, then explicit, dynamic guidance on this could be of use (including something like a regular, model-based Estimated Probability of Passing alert).

On the other hand, the data does suggest that more exogenous constraints probably play a role. To their credit, and that's something that every educational system should attempt to replicate, this Open University data set also includes socio-economic information in the form of the student's approximate Index of Multiple Deprivation, an statistical proxy — based on a ranking comparison between places in England — to issues like crime prevalence, unemployment, education, income, etc, of the place where the student lived during the course.

This index is correlated with the outcome of the course, as would be expected (a higher IMD band indicates a more favorable socio-economic context):

But also with, and arguably through, the number of days students interact with the platform:

So there are factors, which could be cultural but might as well be, and we could easily imagine are, related to constraints in resources, time, energy, support networks, etc., of students living in more deprived areas. If or to the degree to which the latter is the cause, "gamification" features like the one described above would at best be useless and at worst a mockery. The point of data-driven analysis is to be able to determine what's going on, in order to guide our intuition about what could help; this data set suggests possibilities, but that's as far as we can get with it.

Of course, that in this post we're playing at reinventing the wheel — poorly. Education experts are deeply familiar with everything we've discussed so far, from the impact of study time on outcomes to the effect of socioeconomic constraints. The point isn't that we have found anything new, but rather to show how already-known things surface very quickly and obviously whenever data is gathered in a sufficiently comprehensive and open way, and the possibilities for personalized diagnostics and scalable assistance that this might offer as a way of assisting and helping educational systems.

On the topic of things already well-known, we've seen that putting in days of interaction with the platform improves students' chances of passing the course, and that better-educated students have a higher a priori chance of doing it. Is the increased time all of it? In other words, do higher educational achievements, besides being correlated with exogenous and endogenous factors related to being able to study more, also enable students to do it better? Do students with different educational backgrounds get different amounts of value from a day of interacting with the system?

The data set only offers indirect clues to this, but as far as we can see, this is true pretty consistently. For each intensity of interaction with the platform, students with a higher level of education will, generally speaking, do better (click on the graph for a larger version):

We can't distinguish with this data, of course, the details of the mechanics of how this happens; "better study habits" can include anything from a larger store of previous knowledge to draw relationships from, to a better physical environment in which to study. The often large correlations between different factors are part of what makes research in social sciences both difficult and important. But we see there's a difference, which means there's also potential for improved outcomes.

Online learning isn't, in many ways, a radical departure from traditional education: we can see how the traditional issues of socio-economic context, educational history, and effort continue to play the roles they always have. However, the increased legibility of the online process, and the enormous flexibility it offers for interventions and experiments, make it not just a powerful teaching mechanism on its own, but also a tool to help us understand and improve learning in general.

# Hegemonía electoral y outliers estadísticos

Hay elecciones que se ganan cómodamente, otras que se ganan por goleada... y está Santiago del Estero.

El Viernes tuve la suerte de participar en el Datatón Electoral organizado por Antonio Milanese, analizando sets de datos de las elecciones pasadas junto con otros analistas de datos, politólogos, etc. El análisis que probé no respaldó mi hipótesis (es el riesgo de trabajar con datos...), pero dio pie a una observación interesante.

A pesar de la aparente polarización electoral en la Argentina, incluso mesa por mesa los resultados tienden a ser relativamente cercanos. Por ejemplo, en las elecciones para diputados nacionales en el 2017, solo en el 47% de las mesas la opción ganadora en esa mesa sacó más de la mitad de los votos:

La asimetría de esta distribución es lógica (es difícil ser la opción ganadora en una mesa con menos del 40%), pero igualmente la cantidad de mesas en las que la opción ganadora sacó un porcentaje muy alto es en si misma muy alta: en el 1% de las mesas el ganador sacó más del 83% de los votos, algo que en un análisis estadístico superficial no debería pasar casi nunca. Esta es una "anomalía" estadística que refleja un patrón social bastante común. La gente que vota en las mismas mesas tiende a ser más homogénea social y políticamente que la que vota en mesas diferentes, y es natural haya más mesas políticamente homogéneas de lo que sería dado esperar si personas y mesas fuesen asignadas al azar.

Pero por otro lado, si miramos donde están esas mesas inesperadamente homogéneas, surge algo que tiene menos que ver con la sociología abstracta. De las 1004 mesas en las que el ganador sacó más del 83% de los votos...

• 46 están en la Ciudad de Buenos Aires
• 74 están en la Provincia de Buenos Aires
• 87 están en Formosa
• 607 están en Santiago del Estero

A nivel nacional, alrededor de una de cada cien mesas fue por goleada; en Santiago del Estero, más de una de cada tres. Las siguientes son las diez provincias con mayor porcentaje de mesas por goleada (haga click en el gráfico para agrandarlo, pero, como puede imaginar, la barra gigante de la izquierda es Santiago del Estero):

Esta no es una observación sorprendente dada la realidad política de Santiago del Estero o Formosa, pero muestra cómo algunos patrones sociales y políticos locales son visibles incluso en el análisis cuantitativo más superficial.

# Big Data, Endless Wars, and Why Gamification (Often) Fails

Militaries and software companies are currently stuck in something of a rut: billions of dollars are spent on the latest technology, including sophisticated and supposedly game-changing data gathering and analysis, and yet for most victory seems a best to be a matter of luck, and at worst perpetually elusive.

As different as those "industries" are, this common failure has a common root; perhaps unsurprisingly so, given the long and complex history of cultural, financial, and technological relationships between them.

Both military action and gamified software (of whatever kind: games, nudge-rich crowdsourcing software, behaviorally intrusive e-commerce shops, etc) are focused on the same thing: changing somebody else's behavior. It's easy to forget, amid the current explosion — pun not intended — of data-driven technologies, that wars are rarely fought until the enemy stops being able to fight back, but rather until they choose not to, and that all the data and smarts behind a game is pointless unless more players do more of what you want them to do. It doesn't matter how big your military stick is, or how sophisticated your gamified carrot algorithm, that's what they exist for.

History, psychology, and personal experience show that carrots and sticks, alone or in combination, do, work. So why do some wars take forever, and some games or apps whimper and die without getting any traction?

The root cause is that, while carrots and sticks work, different people and groups have different concepts of what counts as one. This is partly a matter of cultural and personal differences, and partly a matter of specific situations: as every teacher knows, a gold star only works for children who care about gold stars, and the threat of being sent to detention only deters those for whom it's not an accepted fact of life, if not a badge of honor. Hence the failure of most online reputational systems, the endemic nature of trolls, the hit-and-miss nature of new games not based on an already successful franchise, or, for that matter, the enormous difficulty even major militaries have stopping insurgencies and other similar actors.

But the root problem behind that root problem isn't a feature in the culture and psychology of adversaries and customers (and it's interesting to note that, artillery aside, the technologies applied on both aren't always different), but in the culture and psychology of civilian and military engineers. The fault, so to speak, is not in our five-stars rating systems, but in ourselves.

How so? As obvious as it is that achieving the goals of gamified software and military interventions requires a deep knowledge of the psychology, culture, and political dynamics of targets and/or customer bases, software engineers, product designers, technology CEOs, soldiers, and military strategists don't receive more than token encouragement to develop a strong foundation in those areas, much less are required to do so. Game designers and intelligence analysts, to mention a couple of exceptions, do, but their advice is often given but a half-hearted ear, and, unless they go solo, they lack any sort of authority. Thus we end, by and large, with large and meticulously planned campaigns — of either sort — that fail spectacularly or slowly fizzle out without achieving their goals, not for failures of execution (those are also endemic, but a different issue) but because the link between execution and the end goal was formulated, often implicitly, by people without much training in or inclination for the relevant disciplines.

There's a mythology behind this: they idea that, given enough accumulation of data and analytical power, human behavior can be predicted and simulated, and hence shaped. This might yet be true — the opposite mythology of some ineffable quality of unpredictability in human behavior is, if anything, even less well-supported by facts — but right now we are far from that point, particularly when it comes to very different societies, complex political situations, or customers already under heavy "attack" by competitors. It's not that people can't be understood, and forms of shaping their behavior designed, it's that this takes knowledge that for now lies in the work and brains of people who specialize in studying individual and collective behavior: political analysts, psychologists, anthropologists, and so on.

They are given roles, write briefs, have fun job titles, and sometimes are even paid attention to. The need for their type of expertise is paid lip service to; I'm not describing explicit doctrine, either in the military or in the civilian world, but rather more insidious implicit attitudes (the same attitudes the drive, in an even more ethically, socially, and pragmatically destructive way, sexism and racism in most societies and organizations).

Women and minorities aside (although there's a fair and not accidental degree of overlap), people with a strong professional formation in the humanities are pretty much the people you're least likely to see — honorable and successful exceptions aside — in a C-level position or having authority over military strategy. It's not just that they don't appear there: they are mostly shunned, and implicitly or explicitly, well, let's go with "underappreciated." Both Silicon Valley and the Pentagon, as well as their overseas equivalents, are seen and see themselves at places explicitly away from that sort of "soft" and "vague" thing. Sufficiently advanced carrots and sticks, goes the implicit tale, can replace political understanding and a grasp of psychological nuance.

Sometimes, sure. Not always. Even the most advanced organizations get stuck in quagmires (Google+, anyone?) when they forget that, absent an overwhelming technological advantage, and sometimes even then (Afghanistan, anyone?) successful strategy begins with a correct grasp of politics and psychology, not the other way around, and that we aren't yet at a point where this can be provided solely by data gathering and analysis.

Can that help? Yes. Is an organization that leverages political analysis, anthropology, and psychology together with data analysis and artificial intelligence like to out-think and out-match most competitors regardless of relative size? Again, yes.

Societies and organizations that reject advanced information technology because it's new have, by and large, been left behind, often irreparably so. Societies and organizations that reject humanities because they are traditional (never mind how much they have advanced) risk suffering the same fate.

# Statistics, Simians, the Scottish, and Sizing up Soothsayers

A predictive model can be a parametrized mathematical formula, or a complex deep learning network, but it can also be a talkative cab driver or a slides-wielding consultant. From a mathematical point of view, they are all trying to do the same thing, to predict what's going to happen, so they can all be evaluated in the same way. Let's look at how to do that by poking a little bit into a soccer betting data set, and evaluating it as if it were an statistical model we just fitted.

The most basic outcome you'll want to predict in soccer is whether a game goes to the home team, the visitors or away team, or is a draw. A predictive model is anything and anybody that's willing to give you a probability distribution over those outcomes. Betting markets, by giving you odds, are implicitly doing that: the higher the odds, the less likely they think is the outcome.

The Football-Data.co.uk data set we'll use contains results and odds from various soccer leagues for more than 37,000 games. We'll use the odds for the Pinnacle platform whenever available (those are closing odds, the last ones available before the game).

For example, for the Juventus-Fiorentina game in August 20, 2016, the odds offered were 1.51 for a Juventus win, 4.15 for a draw (ouch), and 8.61 for a Fiorentina victory (double ouch). Odds of 1.51 for Juventus mean that for each dollar you bet on Juventus, you'd get USD 1.51 if Juventus won (your initial bet included) and nothing if it didn't. These numbers aren't probabilities, but they imply probabilities. If platforms gave odds too high relative to the event's probability they'd go broke, while if they gave odds too low they wouldn't be able to attract bettors. On balance, then, we can read from the odds probabilities slightly lower than the the betting market's best guesses, but, in a world with multiple competing platforms, not really that far from the mark. This sounds like a very indirect justification for using them as a predictive model, but every predictive model, no matter how abstract, has a lot of assumptions; a linear model assumes the relevant phenomenon is linear (almost never true, sometimes true enough), and looking at a betting market as a predictive model assumes the participants know what they are doing, the margins aren't too high, and there isn't anything too shady going on (not always true, sometimes true enough).

We can convert odds to probabilities by asking ourselves: if these odds were absolutely fair, how probable would the event have to be so neither side of the bet can expect to earn anything? (a reasonable definition of "fair" here, with historical links to the earliest developments of the concept of probability). Calling $P$ the probability and $L$ the odds, we can write this condition $PL + (1-P)*0 = 1$. The left side of the equation is how much you get on average — $L$ when, with probability $P$, the event happens, and zero otherwise — and the right side says that on average you should get you dollar back, without winning or losing anything. From there it's obvious that $P = \frac{1}{L}$. For example, the odds above, if absolutely fair (which they never are, not completely, as people in the industry have to eat) would imply a probability for Juventus to win of 66.2%, and for Fiorentina of 11.6% (for the record, Juventus won, 2-1).

In this way we can put information into the betting platform (actually, the participants do), and read out probabilities. That's all we need to use it as a predictive model, and there's in fact a small industry dedicated to building betting markets tailored to predict all sorts of events, like political outcomes; when built with this use in mind, they are called prediction or information markets. The question, as with any model, isn't if it's true or not — unlike statistical models, betting markets don't have any misleading aura of mathematical certainty — but rather how good those probabilities are.

One natural way of answering that question is to compare our model with another one. Is this fancy machine learning model better than the spreadsheet we already use? Is this consultant better than this other consultant? Is this cab driver better at predicting games than that analyst on TV? Language gets very confusing very quickly, so mathematical notation becomes necessary here. Using the standard notation $P[x | y]$ for how likely do I think is that x will happen if y is true?, we can compare the cab driver and the TV analyst by calculating

If that ratio is higher than one, this means of course that the cab driver is better at predicting games than the TV analyst, as she gave higher probabilities to the things that actually happened, and vice versa. This ratio is called the Bayes factor.

In our case, the factors are easy to calculate, as $P[\textrm{home win} | \textrm{odds are good predictors}]$ is just $\textrm{probability of a home win as implied by the odds}$, which we already know how to calculate. And because the probabilities of independent events are the product of the individual probabilities, then

In reality, those events aren't independent, but we're assuming participants in the betting market take into account information from previous games, which is part of what "knowing what you're talking about" intuitively means.

Note how we aren't calculating how likely a model is, just which one of one two models has more support from the data we're seeing. To calculate the former value we'd need more information (e.g., how much you believed the model was right before looking at the data). This is a very useful analysis, particularly when it comes to making decisions, but often the first question is a comparative one.

Using our data set, we'll compare the betting market as a predictive model against a bunch of dart-throwing chimps as a predictive model (dart-throwing chimps are a traditional device in financial analysis). The chimps throw darts against a wall covered with little Hs, Ds, and As, so they always predict each event has a probability of $\frac{1}{3}$. Running the numbers, we get

This is (much) larger than one, so the evidence in the data favors the betting market over the chimps (very; see the link above for a couple of rules of thumb about interpreting those numbers). That's good, and not something to be taken for granted: many stock traders underperform chimps. Note that if one model is better than another, the Bayes factor comparing them will keep growing as you collect more observations and therefore become more certain of it. If you make the above calculation with a smaller data set, the resulting Bayes factor will be lower.

Are odds also better in this sense than just using a rule of thumb about how frequent each event is? In this data set, the home team wins about 44.3% of the time, and the visitors 29%, so we'll assign those outcome probabilities to every match.

That's again overwhelming evidence in favor of the betting market, as expected.

We have statistics, soothsayers, and simians (chimpanzees aren't simians, but I couldn't resist the alliteration). What about the Scottish?

Lets look at how better than chimps are the odds for different countries and leagues or divisions (you could say that the chimps are our null hypothesis, but the concept of null hypothesis is at best a confusing and at worst a dangerous one: quoting the Zen of Python, explicit is better than implicit). The calculations will be the same, applied to subsets of the data corresponding to each division. A difference is that we're going to show the logarithm of the Bayes factor comparing the model implied by the odds and the model from the dart-throwing chimps (otherwise numbers become impractically large), and this divided by the number of game results we have for each division. Why that division? As we said above, if one model is better than another, the more observations you accumulate, the higher the amount of evidence for one over the other you're going to get. It's not that the first model is getting better over time, it's just that you're getting more evidence that it's better. In other words, if model A is slightly better than model B but you have a lot of data, and model C is much better than model D but you only have a bit of data, then the Bayes factor between A and B can be much larger than the one between C and D: the size of an effect isn't the same thing as your certainty about it.

By dividing the (logarithm of) the Bayes factor by the number of games, we're trying to get a rough idea of how good the odds are, as models, comparing different divisions with each other. This is something of a cheat — they aren't models of the same thing! — but by asking of each model how quickly they build evidence that they are better than our chimps, we get a sense of their comparative power (there are other, more mathematically principled ways of doing this, and to a degree the method you choose has to depend on your own criteria of usefulness, which depends on what you'll use the model for, but this will suffice here).

I'm following here the naming convention for divisions used in the data set: E0 is the English Premier League, E1 is their Championship, etc (the larger the number, the "lower" the league), and the country prefixes are: E for England, SC for Scotland, D for Germany, I for Italy, SP for Spain, F for France, N for the Netherlands, B for Belgium, P for Portugal, T for Turkey, and G for Greece. There's quite a bit of heterogeneity inside each country, but with clear patterns. To make them clearer, let's sort the graph by value instead of division, and keep only the lowest and highest five:

The betting odds generate better models for the top leagues of Greece, Portugal, Spain, Italy, and England, and worse ones for the lower leagues, with the very worst modeled one being SC3 (properly speaking, the Scottish League Two – there are the Scottish). This makes sense: the larger leagues have a lot of bettors who want in, many of them professionals, so the odds are going to be more informative.

To go back to the beginning: everything that gives you probabilities about the future is a predictive model. Just because one is a betting market and the other is a chimpanzee, or one is a consultant and the other one is a regression model, it doesn't mean they can't and shouldn't be compared to each other in a meaningful way. That's why it's so critical to save the guesses and predictions of every software model and every "human predictor" you work with. It lets you go back over time and ask the first and most basic question in predictive data science:

How much better is this program or this guy than a chimp throwing darts?

When you think about it, is that really a question you would want to leave unanswered about anything or anybody you work with?

# Why the most influential business AIs will look like spellcheckers (and a toy example of how to build one)

Forget voice-controlled assistants. at work, AIs will turn everybody into functional cyborgs through squishy red lines under everything you type. Let's look at a toy example I just built (mostly to play with deep learning along the way).

I chose as a data set Patrick Martinchek's collection of Facebook posts from news organizations. It's a very useful resource, covering more that a dozen organizations and with interesting metadata for each post, but for this toy model I focused exclusively on the headlines of CNN's posts. Let's say you're a journalist/editor/social network specialist working for CNN, and part of your job is to write good headlines. In this context, a good headline could be defined as one having a lot of shares. How would you use an AI to help you with that?

The first step is simply to teach the AI about good and bad headlines. Patrick's data set included 28,300 posts with both the headline and the count of shares (there were some parsing errors for which I chose just to ignore the data; in a production project the number of posts would've been larger). As what counts as a good headline depends on the organization, I defined a good headline as one that got a number of shares in the top 5% for the data set. This simplifies the task from predicting a number (how many shares) to a much simpler classification problem (good vs bad headline)

The script I used to train the network to perform this classification was Denny Britz' classic Implementing a CNN for text classification in TensorFlow example. It's an introductory model, not meant to have production-level performance (also, it was posted on December 2015, and sixteen months in this field is a very long time), but the code is elegant, well-documented, and easy to understand and modify, so it was the obvious choice for this project. The only changes I made were adapting it to train the network without having to load all of the data in memory at the same time and replacing the parser with one of NLTK's.

After an hour of training on my laptop, testing the model against out-of-sample data gives an accuracy of 93% and a precision for the class of good headlines of 9%. The latter is the metric I cared about for this model: it means that 9% of the headlines the model marks as good are, in fact, good. That's about 80% better than random chance, which is... well, it's not that impressive. But that's after an hour of training with a tutorial example, and rather better than what you'd get from that data set using most other modeling approaches.

In any case, the point of the exercise wasn't to get awesome numbers, but to be able to do the next step, which is where this kind of model moves from a tool used by CNN's data scientists into one that turns writers into cyborgs.

Reaching again into NLTK's impressive bag of tricks, I used its part-of-speech tagger to identify the nouns in every bad headline, and then a combination of WordNet's tools for finding synonyms and the pluralizer in CLiPS' Pattern Python module to generate a number of variants for each headline, creating new variations using simple rewrites of the original one.

So for What people across the globe think of Donald Trump, the program suggested What people across the Earth think of Donald Trump and What people across the world think of Donald Trump. What's more, while the original headline was "bad," the model predicts that the last variation will be good. With a 9% precision for the class, it's not a sure thing, but it's almost twice the a priori probability of the original, which isn't something to sneeze at.

In another case, the program took Dog sacrifices life to save infant in fire, and suggested Dog sacrifices life to save baby in fire. The point of the model is to improve on intuition, and I don't have the experience of whoever writes CNN's post headlines, but that does look like it'd work better.

Where things go from a tool for data analysts to something that changes how almost everybody works is that nothing prevents a trained model from working in the background, constantly checking what you're writing — for example, the headline for your post — and suggesting alternatives. To grasp the true power a tool like this could have, don't imagine a web application that suggests changes to your headline, or even as a tool in your CMS or text editor, but something more like your spellchecker. For example, the "headline" field in your web app will have attached a model trained from the specific data from your organization (and/or from open data sets), which will underline it in red if it predicts it won't work well. Right-click on the text, and it'll show you some alternatives.

Or if the response to a customer you're typing might make them angry.

Or if the presentation you're building has the sort of look that works well on SlideShare.

Or if the code you're writing is similar to the kind of code that breaks your application's test suite.

Or if there's something fishy in the spreadsheet you're looking at.

Or... You get the idea. Whenever you have a classification model and a way to generate alternatives, you have a tool that can help knowledge workers do they work better, a tool that gets better over time — not just learning from its experience, as humans do, but from the collective experience of the entire organization — and no reason not to use it.

"Artificial intelligence," or whatever label you want to apply to the current crop of technologies, is something that can, does, and will work invisibly as part of our infrastructure, and it's also at the core of dedicated data analysis, but it'll also change the way everybody works by having domain-specific models look in real time at everything you're seeing and doing, and making suggestions and comments. Microsoft's Clippy might have been the most universally reviled digital character before Jar Jar Binks, but we've come to depend on unobtrusive but superhuman spellcheckers, GPS guides, etc. Even now image editors work in this way, applying lots of domain-specific smarts to assist and subtly guide your work. As building models for human or superhuman performance on very specific tasks becomes accessible to every organization, the same will apply to almost every task.

It's already beginning to. We don't have, yet, the Microsoft Office of domain-specific AIs, and I'm not sure how that would look like, but, unavoidably, the fact that we can teach programs to perform better than humans in a list of "real-world" tasks that grows almost every week means that organizations that routinely do so — companies that don't wait for fully artificial employees, but that also don't neglect to enhance their employees with every better-than-human narrow AI they can build right now — have an increasing advantage over those that don't. The interfaces are still clumsy, there's no explicit business function or fancy LinkedIn position for it, and most workers, including ironically enough knowledge workers and people with leadership and strategic roles, still have to be convinced that cyborgization, ego issues aside, is a better career choice than eventual obsolescence, but the same barriers applied when business software first became available, yet the crushing economic and business advantages made them irrelevant in a very short amount of time.

The bottom line: Even if you won't be replaced by an artificial intelligence, there will be many specific aspects of your work they will be or are already able to do better than you, and if you can't or won't work with them as part of your daily routine, there's somebody who will. Knowing how to train and team up with software in an effective way will be one of the key work skills of the near future, and whether explicit or not, the "AI Resources Department" — a business function focused on constantly building, deploying, and improving programs with business-specific knowledge and skills — will be at the center of any organization's efforts to become and remain competitive.

# How to be data-driven without data...

...and then make better use of the data you get.

The usefulness of data science begins long before you collect the first data point. It can be used to describe very clearly your questions and your assumptions, and to analyze in a consistent manner what they imply. This is neither a simple exercise nor an academic one: informal approaches are notoriously bad at handling the interplay of complex probabilities, yet even the a priori knowledge embedded in personal experience and publicly available research, when properly organized and queried, can answer many questions that mass quantities of data, processed carelessly, wouldn't be able to, as well as suggest what measurements should be attempted first, and what for.

The larger the gap between the complexity of a system and the existing data capture and analysis infrastructure, the more important it is to set up initial data-free (which doesn't mean knowledge-free) formal models as a temporary bridge between both. Toy models are a good way to begin this approach; as the British statistician George E.P. Box wrote, all models are wrong, but some are useful (at least for a while, we might add, but that's as much as we can ask of any tool).

Let's say you're evaluating an idea for a new network-like service for specialized peer-to-peer consulting that will have the possibility of monetizing a certain percentage of the interactions between users. You will, of course, capture all of the relevant information once the network is running — and there's no substitute for real data — but that doesn't mean you have to wait until then to start thinking about it as a data scientist, which in this context means probabilistically.

Note that the following numbers are wrong: it takes research, experience, and time to figure out useful guesses. What matters for the purposes of this post is describing the process, oversimplified as it will be.

You don't know a priori how large the network will be after, say, one year, but you can look at other competitors, the size of the relevant market, and so on, and guess, not a number ("our network in one year will have a hundred thousand users"), but the relative likelihood of different values.

The graph above shows one possible set of guesses. Instead of giving a single number, it "says" that there's a 50% chance that the network will have at least a hundred thousand users, and a 5.4% chance that it'll have at least half a million (although note that decimals points in this context are rather pointless; a guess based on experience and research can be extremely useful, but will rarely be this precise). On the other hand, there's almost a 25% chance that the network will have less than fifty thousand users, and a 10% chance that it'll have less than twenty-eight thousand.

You can use the same process to codify your educated guesses about other key aspects of the application, like the rate at which members of the network will interact, and the average revenue you'll be able to get from each interaction. As always, neither these numbers nor the specific shape of the curves matter for this toy example, but note how different degrees and forms of uncertainty are represented through different types of probability distributions:

Clearly, in this toy model we're sure about some things like the interaction rate (measured, say, in interactions per month), and very unsure about others, like the average revenue per interaction. Thinking about the implications of multiple uncertainties is one of the toughest cognitive challenges, as humans tend to conceptualize specific concrete scenarios: we think in terms of one or at best a couple of states of the world we expect to happen, but when there are multiple interacting variables, even the most likely scenario might have a very low absolute probability.

Simulation software, though, makes this nearly trivial even for the most complex models. Here's, for example, the distribution of probabilities for the monthly revenue, as necessarily implied by our assumptions about the other variables:

There are scenarios where your revenue is more than USD 10M per month, and you're of course free to choose the other variables so this is one of the handful of specific scenarios you describe (perhaps the most common and powerful of the ways in which people pitching a product or idea exploit the biases and limitations in human cognition). But doing this sort of quantitative analysis forces you to be honest at least to yourself: if what you know and don't know is described by the distributions above, then you aren't free to tell yourself that your chance of hitting it big is other than microscopic, no matter how clear the image might be in your mind.

That said, not getting USD 10M a month doesn't mean the idea is worthless; maybe you can break even and then use that time to pivot or sell it, or you just want to create something that works and is useful, and then grow it over time. Either way, let's assume your total costs are expected to be USD 200k per month (if this were a proper analysis and not a toy example, this wouldn't be an specific guess, but another probability distribution based on educated guesses, expert opinions, market surveys, etc). How do probabilities look then?

You can answer this question using the same sort of analysis:

The inescapable consequence of your assumptions is that your chances of breaking even are 1 in 20. Can they be improved? One advantage of fully explicit models is that you can ask not just for the probability of something happening, but also about how things depend on each other.

Here are the relationships between the revenue, according to the model, and each of the main variables, with a linear best fit approximation superimposed:

As you can see, network size has the clearest relationship with revenue. This might look strange – wouldn't, under this kind of simple model, multiplying by ten the number of interactions keeping the monetization rate also multiply by ten the revenue? Yes, but your assumptions say you can't multiply the number of interactions by more than a factor of five, which, together with your other assumptions, isn't enough to move your revenue very far. So it isn't that it's unreasonable to consider the option of increasing interactions significantly, to improve your chances of breaking even (or even getting to USD 10M). But if you plan to increase outside the explicit range encoded your assumptions, you have to explain why they were wrong. Always be careful when you do this: changing your assumptions to make possible something that would be useful if it were possible is one of humankind's favorite ways of driving directly into blind alleys at high speed.

It's key to understand that none of this is really a prediction about the future. Statistical analysis doesn't really deal with predicting the future or even getting information about the present: it's all about clarifying the implications of your observations and assumptions. It's your job to make those observations and assumptions as good and releevant as possible, both not leaving out anything you know, and not pretending you know what you don't, or that your are more certain about something that you should be.

This problem is somewhat mitigated for domains where we have vast amounts of information, including, recently, areas like computer vision and robotics. But we have yet to achieve the same level of data collection in other key areas like business strategy, so there's no way of avoiding using expert knowledge... which doesn't mean, as we saw, that we have to ditch quantitative methods.

Ultimately, successful organizations do the entire spectrum of analysis activities: they build high-level explicit models, encode expert knowledge, collect as much high-quality data as possible, train machine learning models based on that, and exploit all of that for strategic analysis, automatization, predictive modeling, etc. There are no silver bullets, but you probably have more ammunition than you think.

# When the world is the ad

Data-driven algorithms are effective not because of what they know, but as a function of what they don't. From a mathematical point of view, Internet advertising isn't about putting ads on pages or crafting seemingly neutral content. There's just the input — some change to the world you pay somebody or something to make — and the output — a change in somebody's likelihood of purchasing a given product or voting for somebody. The concept of multitouch attribution, the attempt to understand how multiple contacts with different ads influenced some action, is a step in the right direction, but it's still driven by a cosmology that sees ads as little gems of influence embedded in a larger universe that you can't change.

That's no longer true. The Internet isn't primarily a medium in the sense of something that is between. It's a medium in that we live inside it. It's the atmosphere through which the sound waves of information, feelings, and money flow. It's the spacetime through which the gravity waves from some piece of code shifting from data center to data center according to some post-geographical search of efficiency reach your car to suggest a route. And, on the opposite direction, it's how physical measurements of your location, activities — even physiological state — are captured, shared, and reused in ways that are increasingly more difficult to know about, and much less to be aware of during our daily life. Transparency of action often equals, and is used to achieve, opacity to oversight.

Everything we experience impacts our behavior, and each day more of what we experience is controlled, optimized, configured, personalized — pick your word — by companies desperately looking for a business model or methodically searching for their next billion dollars or ten.

Consider as a harbinger of the future that most traditional of companies, Facebook, a space so embedded in our culture that people older than credit cards (1950, Diners) use it without wonder. Among the constant experimentation with the willingly shared content of our lives that is the company, they ran an experiment attempting to deliberately influence the mood of their users by changing the order of what they read. The ethics of that experiment are important to discuss now and irrelevant to what will happen next, because the business implications are too obvious not to be exploited: some products and services are acquired preferentially by people in a certain mood, and it might be easier to change the mood of an already promising or tested customer than to find another new one.

If nostalgia makes you buy music, why wait until you feel nostalgic to show you an ad, when I can make sure you encounter mentions of places and activities from your childhood? A weapons company (or a law-and-order political candidate) will pay to place their ad next to a crime story, but if they pay more they can also make sure the articles you read before that, just their titles as you scroll down, are also scary ones, regardless of topic. Scary, that is, specifically for you. And knowledge can work just as well, and just as subtly: tracking everything you read, and adapting the text here and there, seemingly separate sources of information will give you "A" and "B," close enough for you to remember them when a third one offers to sell you "C." It's not a new trick, but with ubiquitous transparent personalization and a pervasive infrastructure allowing companies to bid for the right to change pretty much all you read and see, it will be even more effective.

It won't be (just) ads, and it won't be (just) content marketing. The main business model of the consumer-facing internet is to change what they consume, and when it comes down to what can and will be leveraged to do it, the answer is of course all of it.

Along the way, advertising will once again drag into widespread commercial application, as well as public awareness, areas of mathematics and technology currently used in more specialized areas. Advertisers mostly see us — because their data systems have been built to see us — as black boxes with tagged attributes (age, searches, location). Collect enough black boxes and enough attributes, and blind machine learning can find a lot of patterns. What they have barely begun to do is to open up those black boxes to model the underlying process, the illogical logic by which we process our social and physical environment so we can figure out what to do, where to go, what to buy. Complete understanding is something best left to lovers and mystics, but every qualitative change in our scalable, algorithmic understanding of human behavior under complex patterns of stimuli will be worth billions in the next iteration of this arms race.

Business practices will change as well, if only as a deepening of current tendencies. Where advertisers now bid for space on a page or a video slot, they will be bidding for the reader-specific emotional resonance of an article somebody just clicked on, the presence of a given item in a background picture, or the location and value of an item in an Augmented Reality game ("how much to put a difficult-to-catch Pokemon just next to my Starbucks for this person, whom I know has been out in this cold day enough for me to believe it'd like a hot beverage?"). Everything that's controlled by software can be bid upon by other software for a third party's commercial purposes. Not much isn't, and very little won't be.

The cumulative logic of technological development, one in which printed flyers co-exist with personalized online ads, promises the survival of what we might call by then overt algorithmic advertising. It won't be a world with no ads, but one in which a lot of what you perceive is tweaked and optimized so it's collective effect, whether perceived or not, is intended to work as one.

We can hypothesize a subliminally but significantly more coherent phenomenological experience of the world — our cities, friendships, jobs, art — a more encompassing and dynamic version of the "opinion bubbles" social networks often build (in their defense, only magnifying algorithmically the bubbles we had already built with our own choices of friends and activities). On the other hand, happy people aren't always the best customers, so transforming the world into a subliminal marketing platform might end up not being very pleasant, even before considering the impact on our societies of leveraging this kind of ubiquitous, personalized, largely subliminal button-pushing for political purposes.

In any case, it's a race in and for the background, and once that already started.

# (Over)Simplifying Calgary too

One of the good side effects of scripting multi-stage pipelines to build a visualization like my over-simplified map of Buenos Aires is that to process a data source in a completely different format only requires you to write a pre-processing script — everything else remains the same.

While I had used CSV data for the Buenos Aires map, I got KML files for the equivalent land use data for the City of Calgary. The pipeline I had written expected use types tied to single points mapped into a fixed grid, so I wrote a small Python script to extract the polygons defined in the KML file, overlay a grid over them, and assign to each grid point the land use value of the polygon that contained id.

After that the analysis was straightforward. Here's the detailed map of land uses (with less resolution than the original data, as the polygons have been projected on the point grid):

Here's the smoothed-out map:

This is how we split it into a puzzle of more-or-less single-use sectors:

And here's how it looks when you forget the geometry and only care about labels and relative (click to read the labels):

Unlike Buenos Aires, I've never been to Calgary, but a quick look at online maps seem to support the above as a first approximation to the city geography. I'd love to hear how from somebody who actually knows the city whether and how it matches their subjective map of the city.

# (Over)Simplifying Buenos Aires

This is a very rough sketch of the city of Buenos Aires:

As the sketch shows, it's a big blob of homes (VIVIENDAs), with an office-ridden downtown to the East (OFICINAS) and a handful of satellite areas.

The sketch, of course, lies. Here's a map that's slightly less of a lie:

Both maps are based on the 2011 land usage survey made available by the Open Data initiative of the Buenos Aires city government, more than 555k records assigning each spot to one of about 85 different use regimes. It's still a gross approximation — you could spend a lifetime mapping Buenos Aires, rewrite Ulysses for a porteño Leopold Bloom, and still not really know it — but already one so complex that I didn't add the color key to the map. I doubt anybody will want to track the distribution of points for each of the 85 colors.

Ridiculous as it sounds at first, I'd suggest we are using too much of the second type of graph, and not enough of the first. It's already a commonplace that data visualizations shouldn't be too complex, but I suspect we are overestimating what people wants from a first look at a data set. Sometimes "big blob of homes with a smaller downtown blob due East" is exactly the level of detail somebody needs — the actual shape of the blobs being irrelevant.

The first graph, needless to say, was created programmatically from the same data set from which I graphed the second. It's not a difficult process, and the intermediate steps are useful on their own.

Beginning with the original graph above, you apply something like an smoothing brush to the data points (or a kernel, if you want to sound more mathematical); essentially, you replace the land use tag associated to each point with the majority of the uses in its immediate area, smoothing away the minor exceptions. As you'd expect, it's not that there aren't any businesses in Buenos Aires, it's just that, plot by plot, there are more homes, and when you smooth everything out, it looks more like a blob of homes. This leads to an already much simplified map:

Now, one interesting thing about most peoples' sense of space is that it's more topological than metrical, that is, we are generally better at knowing what's next to what than their absolute sizes and positions. Data visualizations should go with the grain of human perceptual and cognitive instincts instead of against them, so one fun next step is to separate the blobs — contiguous blocks of points of the same (smoothed out) land use type — from each other, and show explicitly what's next to what. It looks like this:

Nodes are scaled non-linearly, and we've filtered out the smaller ones, but we've already done programmatically something that we usually leave to the human looking at a map. We've done a napkin sketch of the city, much as somebody would draw North America as a set of rectangles with the right shared frontiers, but not necessarily much precision in the details. It wouldn't do for a geographical survey, but if you were an extraterrestrial planning to invade Canada, it would provide a solid first understanding of the strategic relevance of Mexico to your plans. From that last map to the first one, it's only a matter of remembering that you don't really care, at this stage, about the exact shape of each blob, just where they stand in relationship to each other. So you replace the blogs with the appropriate land use label, and keep the edges between them. And presto, you have a napkin map.

Yes, on the whole the example is rather pointless. Cities are actually the most over-mapped territories on the planet, at both the formal and informal level. Manhattan is an island, Vatican City is inside Rome, the Thames goes through London... In fact, the London Tube Map has become a cliche example about how to display information about a city in terms of connections instead of physical distance. Not to mention that a simplification process that leaves most of the city as a big blob of homes is certainly ignoring more information that you can afford to, even in an sketch.

Not that we usually do this kind of sketching, at least in our formal work with data. We are almost always cartographers when it comes to new data sets, whether geographical, spatial in a general sense, or just mathematically space-like. We change resolution, simplify colors, resist the temptation of over-using 3D, but keep it a "proper" map. Which is good; the world is complex enough for us not to do the best mapping we can.

However, once you automate the process of creating multiple levels of simplification and sketching as above, you'll probably find yourself at least glancing at the simplest (over)simplifications of your data sets. Probably not for presentations to internal or external clients, but for understanding a complex spatial data set, particularly if it's high-dimensional, beginning with an over-simplified summary and then increasing the complexity is in fact what you're already going to do in your own mind, so why not use the computer to help you out?

ETA: I just posted a similar map of Calgary.

# The job of the future isn't creating artificial intelligences, but keeping them sane

Once upon a time, we thought there was such a thing as bug-free programming. Some organizations still do — and woe betide their customers — but after a few decades hitting that particular wall, the profession has by and large accepted that writing software is such an extremely complex intellectual endeavor that errors and unfounded assumptions are unavoidable. Even the most mathematically solid of formal methods has, if nothing else, to interact with a world of unstable platforms and unreliable humans, and what worked today will fail tomorrow.

So we spend time and resources maintaining what we already "finished," fixing bugs as they are found, and adapting programs to new realities as they develop. We have to, because when we don't, as when physical infrastructure isn't maintained, we save resources in the short term, but only in our way towards protracted ruin.

It's no surprise that this also happens with our most sofisticated data-driven algorithms. CVs and scrum boards are filled with references to the maintenance of this or that prediction or optimization algorithm.

But there's a subtle, not universal but still very prevalent, problem: those aren't software bugs. This isn't to say that implementations don't have bugs; being software, they do. But they are computer programs implementing inference algorithms, which work at a higher level of abstraction, and those have their own kinds of bugs, and those don't leave stack traces behind.

A clear example is the experience of Google. PageRank was, without a doubt, among the most influential algorithms in the history of the internet, not to mention the most profitable, but as Google took the internet over by storm, gaming PageRank became such an important business activity that "SEO" became a commonplace word.

From an algorithmic point of view this simply a maintenance problem: PageRank assumed a certain relationship between link structure and relevance, based on the assumption that website creators weren't trying to fool it. Once this assumption became untenable, the algorithm had to be modified to cope with a world of link farms and text written with no human reader in mind.

In (very loosely equivalent) software terms, there was a new threat model, so Google had to figure out and apply a security patch. This is, for any organization facing a simular issue, a continual business-critical process, and one that could make or break a company's profitability (just ask anybody working on high-frequency trading). But not all companies deploy the same sort of detailed, continuous instrumentalization, and development and testing methodologies that they use to monitor and fix their software systems to their data driven algorithms independently of their implementations. The same data scientist who developed an algorithm is often in charge of monitoring its performance on a more or less regular basis; or, even worse, it's only a hit to business metrics what makes companies reassingn their scarce human resources towards figuring out what's going wrong. Either monitoring and maintenance strategy would amount to criminal malpractice if we were talking about software, yet there are companies for which is this is the norm.

Even more prevalent is the lack of automatic instrumentalization for algorithms mirroring that for servers. Any organization with a nontrivial infrastructure is well aware of, and has analysis tools and alarms for, things like server load or application errors. There are equivalent concepts for data-driven algorithms — quantitative statistical assumptions, wildly erroneous predictions — that should, also, be monitored in real time, and not collected (when the data is there) by a data scientist only after the situation has become bad enough to be noticed.

None of this is news to anybody working with big data, particularly in large organizations centered around this technology, but we have still to settle on a common set of technologies and practices, and even just on an universal agreement on its need.

These days nobody would dare deploy a web application trusting only server logs at the operating system level. Applications have their own semantics, after all, and everything in the operating system working perfectly is no guarantee that the app is working at all.

Large-scale prediction and optimization algorithms are just the same; they are often an abstraction running over the application software that implements them. They can be failing wildly, statistical assumptions unmet and parameters converging to implausible values, with nothing in the application layer logging even a warning of any kind.

Most users forgive a software bug much more easily than unintelligent behavior in avowedly intelligent software. As a culture, we're getting used to the fact that software fails, but many still buy the premise that artificial intelligence doesn't (this is contradictory, but so are all myths). Catching these errors as early as possible can only be done while algorithms are running in the real world, where the weird edge cases and the malicious users are, and this requires metrics, logs, and alarms that speak of what's going on in the world of mathematics, not software.

We haven't converged yet on a standard set of tools and practices for this, but I know many people who'll sleep easier once we have.

# The future of machine learning lies in its (human) past

Superficially different in goals and approach, two recent algorithmic advances, Bayesian Program Learning and Galileo, are examples of one of the most interesting and powerful new trends in data analysis. It also happens to be the oldest one.

Bayesian Program Learning (BPL) is deservedly one of the most discussed modeling strategies of recent times, matching or outperforming both humans and deep learning models in one-shot handwritten character classification. Unlike many recent competitors, it's not a deep learning architecture. Rather (and very roughly) it understands handwritten characters as the output of stochastic programs that join together different graphical parts or concepts to generate versions of each character, and seeks to synthesize them by searching through the space of possible programs.

Galileo is, at first blush, a different beast. It's a system designed to extract physical information about the objects in an image or video (e.g., their movements), coupling a deep layer module with a 3D physics engine which acts as a generative model.

Although their domains and inferential algorithms are dissimilar, the common trait I want to emphasize is that they both have at their core domain-specific generative models that encode sophisticated a priori knowledge about the world. The BPL example knows implicitly, through the syntax and semantics of the language of its programs, that handwritten characters are drawn using one or more continuous strokes, often joined; an standard deep learning engine, beginning from scratch, would have to learn this. And Galileo leverages a proper, if simplified, 3D physics engine! It's not surprising that, together with superb design and engineering, these models show the performance they do.

This is how all cognitive processing tends to work in the wider world. We are fascinated, and of course how could we not be?, by how much our algorithms can learn from just raw data. To be able to obtain practical results in multiple domains is impressive, and adds to the (recent, and, like all such things, ephemeral) mystique of the data science industry. But the fact is that no successful cognitive entity starts from scratch: there is a lot about the world that's encoded in our physiology (we don't need to learn to pump our blood faster when we are scared; to say that evolution is a highly efficient massively parallel genetic algorithm is a bit of a joke, but also true, and what it has learned is encoded in whatever is alive, or it wouldn't be).

Going to the other end of the abstraction scale, for all of the fantastically powerful large-scale data analysis tools physicists use and in many cases depend on, the way even basic observations are understood is based on centuries of accumulated (or rather constantly refined) prior knowledge, encoded in specific notations, theories, and even theories about how theories can look like. Unlike most, although not all, industrial applications, data analysis in science isn't a replacement of explicitly codified abstract knowledge, but rather stands on its gigantic shoulders.

In parallel to continuous improvement in hardware, software engineering, and algorithms, we are going to see more and more often the deployment of prior domain knowledge as part of data science implementations. The logic is almost trivial: we have so much knowledge accumulated about so many things, that any implementation that doesn't leverage whatever is known in its domain is just not going to be competitive.

Just to be clear, this isn't a new thing, or a conceptual breakthrough. If anything, it predates the take the data and model it approach that's most popularly seen as "data science," and almost every practitioner, many of them coming from backgrounds in scientific research, is aware of it. It's simply that now our data analysis tools have become flexible and powerful for us to apply it with increasingly powerful results.

The difference in performance when this can be done, as I've seen in my own projects and is obvious in work like BPL and Galileo, has always been so decisive that doing things in any other way soon becomes indefensible except on grounds of expediency (unless of course you're working in a domain that lacks any meaningful theoretical knowledge... a possibility that usually leads to interesting conversations with the domain experts).

The cost is that it does shift significantly the way in which data scientists have to work. There are already plenty of challenges in dealing with the noise and complexities of raw data, before you start considering the ambiguities and difficulties of encoding and leveraging sometimes badly misspecified abstract theories. Teams become heterogeneous at a deeper level, with domain experts — many of them with no experience in this kind of task — not only validating the results and providing feedback, but participating actively as sources of knowledge from day one. Projects take longer. Theoretical assumptions in the domain become explicit, and therefore design discussions take much longer.

And so on and so forth.

That said, the results are very worth it. If data science is about leveraging the scientific method for data-driven decision-making, it behooves us to always remember that step zero of the scientific method is to get up to date, with some skepticism but with no less dedication, on everything your predecessors figured out.

# What places do we know the most about?

Datasthesia just posted online photos of a 3D printed map of geotagged Wikipedia articles, based on map from the Oxford Internet Institute.

# The Telemarketer Singularity

The future isn't a robot boot stamping on a human face forever. It's a world where everything you see has a little telemarketer inside them, one that knows everything about you and never, ever, stops selling things to you.

In all fairness, this might be an slight oversimplification. Besides telemarketers, objects will also be possessed by shop attendants, customer support representatives, and conmen.

What these much-maligned but ubiquitous occupations (and I'm not talking here about their personal qualities or motivations; by and large, they are among the worst exploited and personally blameless workers in the service economy) have in common is that they operate under strict and explicitly codified guidelines that simulate social interaction in order to optimize a business metric.

When a telemarketer and a prospect are talking, of course, both parties are human. But the prospect is, however unconsciously, guided by a certain set of rules about how conversations develop. For example, if somebody offers you something and you say no, thanks, the expected response is for that party to continue the conversation under the assumption that you don't want it, and perhaps try to change your mind, but not to say ok, I'll add it to your order and we can take it out later. The syntax of each expression is correct, but the grammar of the conversation as a whole is broken, always in ways specifically designed to manipulate the prospect's decision-making process. Every time you have found yourself talking on the phone with a telemarketer, or interacting with a salesperson, far longer than you wanted to, this was because you grew up with certain unconscious rules about the patterns in which conversations can end — and until they make the sell, they will neither initiate nor acknowledge any of them. The power isn't in their sales pitch, but in the way they are taking advantage of your social operating system, and the fact that they are working with a much more flexible one.

Some people, generally described by the not always precise term sociopath, are just naturally able to ignore, simulate, or subvert these underlying social rules. Others, non-sociopathic professional conmen, have trained themselves to be able to do this, to speak and behave in ways that bypass or break our common expectations about what words and actions mean.

And then there are telemarketers, who these days work with statistically optimized scripts that tell them what to say in each possible context during a sales conversation, always tailored according to extensive databases of personal information. They don't need to train themselves beyond being able to convey the right emotional tone with their voices: they are, functionally, the voice interface of a program that encodes the actual sales process, and that, logically, has no need to conform to any societal expectation of human interaction.

It's tempting to call telemarketers and their more modern cousins, the computer-assisted (or rather computer-guided) sales assistants, the first deliberately engineered cybernetic sociopaths, but this would miss the point that what matters, what we are interacting with, isn't a sales person, but the scripts behind them. The person is just the interface, selected and trained to maximize the chances that we will want to follow the conversational patterns that will make us vulnerable to the program behind.

Philosophers have long toyed with a mental experiment called the Chinese Room: There is a person inside a room who doesn't know Mandarin, but has a huge set of instructions that tells her what characters to write in response to any combination of characters, for any sequence of interactions. The person inside doesn't know Mandarin, but anybody outside who does can have an engaging conversation by slipping messages under the door. The philosophical question is, who is the person outside dialoging with? Does the woman inside the room know Mandarin in some sense? Does the room know?

Telemarketers are Chinese Rooms turned inside-out. The person is outside, and the room is hidden from us, and we aren't interacting socially with either. We only think we do, or rather, we subconsciously act as if we do, and that's what makes cons and sales much more effective than, rationally, they should be.

We rarely interact with salespeople, but we interact with things all the time. Not because we are socially isolated, but because, well, we are surrounded by things. We interact with our cars, our kitchens, our phones, our websites, our bikes, our clothes, our homes, our workplaces, and our cities. Some of them, like Apple's Siri or the Sims, want us to interact with them as if they were people, or at least consider them valid targets of emotional empathy, but what they are is telemarketers. They are designed, and very carefully, to take advantage of our cultural and psychological biases and constraints, whether it's Siri's cheerful personality or a Sim's personal victories and tragedies.

Not every thing offer us the possibility of interacting with them as if they were human, but that doesn't stop them from selling to us. Every day we see the release of more smart objects, whether it's consumer products or would-be invisible pieces of infrastructure. Connected to each other and to user profiling databases, they see us, know us, and talk to each and to their creators (and to their creators' "trusted partners," who aren't necessarily anybody you have even heard of) about us.

And then they try to sell us things, because that's how the information economy seems to work in practice.

In some sense, this isn't new. Expensive shoes try to look cool so other people will buy them. Expensive cars are in a partnership with you to make sure everybody knows how awesome they make you look. Restaurants hope that some sweet spot of service, ambiance, food, and prices will make you a regular. They are selling themselves, as well as complementary products and services.

But smart objects are a qualitatively different breed, because, being essentially computers with some other stuff attached to them, what their main function is might not be what you bought them for.

Consider an internet-connected scale that not only keeps track of your weight, but also sends you through a social network congratulatory messages when you reach a weight goal. From your point of view, it's just a scale that has acquired a cheerful personality, like a singing piece of furniture in a Disney movie, but from the point of view of the company that built and still controls it, it's both a sensor giving them information about you, and a way to tell you things you believe are coming from something – somebody who knows you, in some ways, better than friends and family. Do you believe advertisers won't know whether to sell you diet products or a discount coupon in the bakery around the corner from your office? Or, even more powerfully, that your scale won't tell you You have earned yourself a nice piece of chocolate cake ;) if the bakery chain is the one who purchased that particular "pageview?"

Maybe they won't be reporting everything you say verbatim, that will depend on how much external scrutiny there is on the industry, but your mood (did you yell at your car today, or sang aloud as you drove?), your movements, the time of the day you wake up, which days you cook and which days you order takeout? Everybody trying to sell things to you will know all of this, and more.

If, in defense of old-school human interaction, you go inside some store to talk with an actual human being instead of an online shop, a computer will be telling each sales person, through a very discrete earbud, how you're feeling today, and how to treat you so you'll feel you want to buy whatever they are selling, the functional equivalent of almost telepathic cold reading skills (except that it won't be so cold; the sales person doesn't know you, but the sales program... the sales program knows you, in many ways, better than you do yourself). In a rush? The sales program will direct the sales person to be quick and efficient. Had a lousy day? Warmth and sympathy. Or rather simulations thereof; you're being sold to by a sales program, after all, or an Internet of Sales Programs, all operating through salespeople, the stuff in your home and pockets, and pretty much everything in the world with an internet connection, which will be almost everything you see and most of what you don't.

Those methods work, and have probably worked since before recorded history, and knowing about them doesn't make them any less effective. They might not make you spend more in aggregate; generally speaking, advertising just shifts around how much you spend on different things. From the point of view of companies, it'll just be the next stage in the arms race for ever more integrated and multi-layered sensor and actuator networks, the same kind of precisely targeted network-of-networks military planners dream of.

For us as consumers, it might mean a world that'll feel more interested in you, with unseen patterns of knowledge and behavior swirling around you, trying to entice or disturb or scare or seduce you, and you specifically, into buying or doing something. It will be a somewhat enchanted world, for better and for worse.

# Soccer, messy data, and why I don't quite believe what this post says

Here's the open secret of the industry: Big Data isn't All The Data. It's not even The Data You Thought You Had. By and large, we have good public data sets about things governments and researchers were already studying, and good private data sets about things that it's profitable for companies to track. But that covers an astonishingly thin and uneven slice of our world. It's bigger than it ever was, and it's growing, but it's still not nearly as large, or as usable, as most people think.

And because public and private data sets are highly specific side effects from other activities, each of them with its own conventions, languages, and even ontologies (in both the computer science and philosophical senses of the word), coordinating two or more of them together is at best a difficult and expensive manual process, and at worst impossible. Not all, but most data analysis case studies and applications end up focused on extracting as much value as possible from a given data set, rather than seeing what new things can be learned by putting that data in the context of the rest of the data we have about the world. Even the larger indexes of open data sets (very useful services that they are) end up being mostly collections of unrelated pieces of information, rather than growing knowledge bases about the world.

There's a sort of informational version of Metcalfe's law (maybe "the value of a group of data sets grows as the number of connections you can make between them") that we are missing on, and that lies behind the promise of both linked data sets (still far in its early phase) and the big "universal" knowledge bases that aim at offering large, usable, interconnected sets of facts about as many different things as possible. They, or something like they, are a necessary part of the infrastructure to give computers the same boost in information access the Internet gave us. The bottleneck of large-scale inference systems like IBM's Watson isn't computer power, but rather rich, well-formatted data to work on.

To try and test the waters on the state of these knowledge bases, I set out to do a quick, superficial analysis of the careers of Argentine soccer players. There are of course companies that have records not only of players' careers, but of pretty much every movement they have ever done on a soccer field, as well as fragmented public data sets collected by enthusiasts about specific careers or leagues. I wanted to see how far I could go using a single "universal" data set that I could later correlate with other information in an automated way. (Remember, the point of this exercise wasn't to get the best data possible about the domain, but to see how good the data is when you restrict yourself to a single resource that can be accessed and processed in a uniform way.)

I went first for the best known "universal" structured data sources: Freebase and Wikidata. They are both well structured (XML and/or JSON) and of significant size (almost 2.9 billion facts and almost 14 million data items, respectively), but after downloading, parsing, and exploring each of them, I had to concede that neither was good enough: there were too many holes in the information to make an analysis, or the structure didn't hold the information I needed.

So it was time for Plan C, which is always the worst idea except when you have nothing else, and even then it could still be: plain old text parsing. It wasn't nearly as bad as it could have been. Wikipedia pages, like Messi's have neat infoboxes that include exactly the simplified career information I wanted, and the page's source code shows that they are written in what looks like a reasonable mini-language. It's a sad comment on the state of the industry that even then I wasn't hopeful.

I downloaded the full dump of Wikipedia; it's 12GB of compressed XML (not much, considering what's in there), so it was easy to extract individual pages. And because there is an index page of Argentine soccer players, it was even easy to keep only those, and then look at their infoboxes.

Therein lay the rub. The thing to remember about Wikipedia is that it's written by humans, so even the parts that are supposed to have strict syntactic and formatting rules, don't (so you can imagine what free text looks like). Infoboxes should have been trivial to parse, but they have all sorts of quirks that aren't visible when rendered in a browser: inconsistent names, erroneous characters, every HTML entity or Unicode character that half-looks like a dash, etc, so parsing it became an exercise on handling special cases.

I don't want to seem ungrateful: it's certainly much, much, much better to spend some time parsing that data than having to assemble and organize it from original sources. Wikipedia is an astounding achievement. But every time you see one of those TV shows where the team nerds smoothly access and correlate hundreds of different public and private data sources in different formats, schemas, and repositories, finding matches between accounting records, newspaper items, TV footage, and so on... they lie. Wrestling matches might arguably be more realistic, if nothing else because they fall within the realm of existing weaponized chair technology.

In any case, after some wrestling of my own with the data, I finally had information about the careers of a bit over 1800 Argentine soccer players whose professional careers in the senior leagues began in 1990 or later. By this point I didn't care very much about them, but for completeness' sake I tried to answer a couple of questions: Are players less loyal to their teams than they used to be? And how soon can a player expect to be playing in one of the top teams?

To make a first pass at the questions, I looked at the number of years players spent in each team over time (averaged over players that began their careers on each calendar year).

The data (at least in such a cursory summary) doesn't support the idea that newer players are less loyal to their teams, as they don't spend significantly less amount of time in them. Granted, this loyalty might be to their paychecks rather than to the clubs themselves, but they aren't moving between clubs any faster than they used to.

The other question I wanted to look at was how fast players get to top teams. This is actually an interesting question in a general setting; characterizing and improving paths to expertise, and thereby improving how much, quickly, and well we all learn, is one of the still unrealized promises of data-driven practices. To take a quick look at this, I plotted the probability of playing for a top ten team (based on the current FIFA club ratings, so they include Barcelona, Real Madrid, Bayern Munich, etc) by career year, normalized by the probability of starting your professional career in one of those teams.

Despite the large margins of error (reasonable given how few players actually reach those teams), the curve does seem to suggest a large increase in the average probability during the first three or four years, and then an stable probability until the ninth or tenth year, at which it peaks. The data is too noisy to make any definite conclusions (more on that below), but, with more data, I would want to explore the possibility of there being two paths to the top teams, corresponding to two sub-groups of highly talented players: either explosive young talents that are quickly transferred to the top teams, and solid professionals that accumulate experience and reach those teams at the peak of their maturity and knowledge.

It's a nice story, and the data sort of fits, but when I look at all the contortions I had to make to get the data, I wouldn't want to put much weight on it. In fact, I stopped myself from doing most of the analysis I wanted to do (e.g., Can you predict long-term career paths from their beginning? There's an interesting agglomerative algorithm for graph simplification that has come handy in the analysis of online game play, and I wanted to see how it fares for athletes). I didn't not because the data doesn't support it, but because the risk of systematic parsing errors, biases due to notability (do all Argentine players have a Wikipedia page? I think so, but how to be sure?), etc.

Of course, if this were a paid project it wouldn't be difficult to put together the resources to check the information, compensate for biases, and so on. But every thing that needs to be a paid project to be done right is something that we can't consider an ubiquitous resource (imagine building the Internet with pre-Linux software costs for operating systems, compilers, etc, including the hugely higher training costs that would come from losing generations of sysadmins and programmers that began practicing on their own at a very early age). Although we're way ahead of where we were a few years ago, we're still far from where we could, and probably need, to be. Right now you need knowledgeable (and patient!) people to make sure data is clean, understandable, and makes sense, even data that you have collected yourself; this makes data analysis a per-project service, rather than a universal utility, and one relatively very expensive as you increase the number of interrelated data sets you need to use. Although the difference of cost is only quantitative, the difference in cumulative impact isn't.

The frustrating bit is that we aren't too far from that (on the other hand, we've been twenty years away from strong A.I. and commercial nuclear fusion since before I was born): there are tools that automate some of this work, although they have their own issues and can't really be left on their own. And Google, as always, is trying to jump ahead of everybody else, with its Knowledge Vault project attempting to build an structured facts database out of the entirety of the web. If they, or somebody else, succeeds at this, and if this is made available at utility prices... Well, that might make those TV shows more realistic — and change our economy and society at least as much as the Internet itself did.

# Quantitatively understanding your (and others') programming style

I'm not, in general, a fan of code metrics in the context of project management, but there's something to be said for looking quantitatively at the patterns in your code, specially if by comparing them with those of better programmers, you can get some hopefully useful ideas on how to improve.

(As an aside, the real possibilities in computer-assisted learning won't come from lower costs, but rather by a level of adaptability that so far not even one-on-one tutoring has allowed; if the current theories about expertise are more or less right, data-driven adaptive learning, if implemented at the right granularity level and with the right semantics model behind, could change the speed and depth the way we learn in a dramatic way... but I digress.)

Focusing on my ongoing learning of Hy, I haven't used it in any paid project so far, but I've been able to play a bit with it now and then, and this has generated a very small code base, which I was curious to compare with code written by people who actually know the language. To do that, I downloaded the source code of a few Hy projects on GitHub (hyway, hygdrop, and adderall), and wrote some code (of course, in Hy) to extract code statistics.

Hy being a Lisp, its syntax is beautifully regular, so you can start by focusing on basic but powerful questions. The first one I wanted to know was: which functions am I using the most? And how does this distribution compares with that of the (let's call it) canon Hy code?

My top five functions, in decreasing frequency: setv, defn, get, len, for.

Canon's top five functions, in decreasing frequency: ≡, if, unquote, get, defn_alias.

Yikes! Just from this, it's obvious that there are some serious stylistic differences, which probably reflect my still un-lispy understanding of the language (for example, I'm not using aliases, for should probably be replaced by more functional patterns, and the way I use setv, well, it definitely points out to the same). None of this is a "sin", nor points clearly to how I could improve (which a sufficiently good learning assistant would have), but the overall trust of the data is a good indicator of where I still have a lot of learning to do. Fun times ahead!

For another angle at the quantitative differences between my newbie-to-lisp coding style and more accomplished programmers, here are the histograms of the log mean size of subexpressions for each function (click to expand):

"Canonical" code shows a longer right tail, which shows that experienced programmers are not afraid of occasionally using quite large S-expressions... something I still clearly I'm still working my way up to (alternatively, which I might need to reconsider my aversion to).

In summary: no earth-shattering discoveries, but some data points that suggests specific ways in which my coding practice in Hy differs from that of more experienced programmers, which should be helpful as general guidelines as I (hopefully) improve over the long term. Of course, all metrics are projections (in the mathematical sense) — they hide more information than they preserve. I could make my own code statistically indistinguishable from the canon for any particular metric, and still have it be awful. Except for well-analyzed domains where known metrics are sufficient statistics for the relevant performance (and programming is very much not one of those domains, despite decades of attempts), this kind of analysis will always be about suggesting changes, rather than guaranteeing success.

# Why we should always keep Shannon in mind

Sometimes there's no school like old school. A couple of weeks ago I spent some time working with data from GitHub Archive, trying to come up with a toy model to predict repo behavior based on previous actions (will it be forked? will there be a commit? etc). My first attempt was to do a sort of brute-force Hidden Markov Model, synthesizing states from the last k actions such that the graph of state-to-state transition was as nice as possible (ideally, low entropy of belonging to a state, high entropy for the next state conditional on knowing the current one). The idea was to do everything by hand, as a way to get more experience with Hy in a work-like project.

All of this was fun (and had me dealing, weirdly enough, with memory issues in Python, although those might have been indirectly caused by Hy), but was ultimately the wrong approach, because, as I realized way, way too late, what I really wanted to do was just to predict the next action given a sequence of actions, which is the classical problem of modeling non-random string sequences (just consider each action a character in a fixed alphabet).

So I facepalmed and repeated to myself one of those elegant bits of early 20th-century mathematics we use almost every day and forget even more often: modeling is prediction is compression is modeling. It's all, from the point of view of information theory, just a matter of perspective.

If you haven't been exposed to the relationship of compression and prediction before, here's a fun thought experiment: if you had a perfect/good enough predictive model of how something behaves, you would just need to show the initial state and say "and then it goes as predicted for the next 10 GB of data", and that would be that. Instant compression! Having a predictive model lets you compress, and inside every compression scheme there's a hidden predictive model (for true enlightenment, go to Shannon's paper, which is still worthy of being read almost 70 years later).

As a complementary example, what the venerable Lempel-Ziv-Welch ("zip") compression algorithm does is, handwaving away bookkeeping details, to build incrementally a dictionary of the most frequent substrings, making sure sure that those are assigned the shorter names in the "translated" version. By the obvious counting arguments, this means infrequent strings will get names that are longer than they are, but on average you gain space (how much? entropy much!). But this also lets you build a barebones predictive model: given the dictionary of frequent substrings that the algorithm has built so far, look at your past history, see which frequent substrings extend your recent past, and assume one of them is going to happen — essentially, your prediction is "whatever would make for a shorted compressed version", which you know is a good strategy in general, because compressed versions do tend to be shorter.

So I implemented the core of a zip encoder in Hy, and then used it to predict github behavior. It's primitive, of course, and the performance was nothing to write a post about (which is why this post isn't called A predictive model of github behavior), but on the other hand, it's an extremely fast streaming predictive algorithm that requires zero configuration. Nothing I would use in a job &mdahs; you can get much better performance with more complex models, which are also the kind you get paid for — but it was educative to encounter a forceful reminder of the underlying mathematical unity of information theory.

In a world of multi-warehouse-scale computers and mind-bendingly complex inferential algorithms, it's good to remember where it all comes from.

# Posted elsewhere: another look at traffic accidents on La Nación Data, and a short story

Made a post on La Nación Data, elaborating a bit on an interesting post they made before about traffic accident death statistics in Argentina.

Also, I just posted The Long Game, a perhaps depressing story about chess, AI, and what it means to just be the best human player in the world.

# The nominalist trap in Big Data analysis

Nominalism, formerly the novelty of a few, wrote Jorge Luis Borges, today embraces all people; its victory is so vast and fundamental that its name is useless. Nobody declares himself nominalist because there is nobody who is anything else. He didn't go on to write This is why even successful Big Data projects often fail to have an impact (except in some volumes kept in the Library of Babel), but his understandable omission doesn't make the diagnosis any less true.

Nominalism, to oversimplify the concept enough for the case at hand, is simply the assumption that just because there are many things in our world which we call chairs, that doesn't imply that the concept itself of a chair is real in a concrete sense, that there is an Ultimate, Really-Real Chair, perhaps standing in front of an Ultimate Table. We have things we call chairs, and we have the word "chair", and those are enough to furnish our houses and our minds, even if some carpenters still toss around at night, haunted by half-glimpses of an ideal one.

It has become a commonplace, quite successful way of thinking, so it's natural for it to be the basis of what's perhaps the "standard" approach to Big Data analysis. Names, numbers, and symbols are loaded into computers (account identifiers, action counters, times, dates, coordinates, prices, numbers, labels of all kinds), and then they are obsessively processed in an almost cabalistic way, organizing and re-organizing them in order to find and clarify whatever mathematical structure, and perhaps explanatory or even predictive power, they might have — and all of this data manipulation, by and large, takes place as if nothing were real but the relationships between the symbols, the data schemas and statistical correlations. Let's not blame the computers for it: they do work in Platonic caves filled with bits, with further bits being the only way in which they can receive news from the outside world.

This works quite well; well enough, in fact, to make Big Data a huge industry with widespread economic and, increasingly, political impact, but it can also fail in very drastic yet dangerously understated ways. Because, you see, from the point of view of algorithms, there *are* such things as Platonic ideals — us. Account 3788 is a reference to a real person (or a real dog, or a real corporation, or a real piece of land, or a real virus) and although we cannot right now put all of the relevant information about that person in a file, and associate it with the account number, that information, the fact of its being a person represented by a data vector, rather than a data vector, makes all the difference between the merely mathematically sophisticated analyst and the effective one. Properly performed, data analysis is the application of inferential mathematics to abstract data, together with the constant awareness and suspicion of the reality the data describes, and what this gap, all the Unrecorded bits, might mean for the problem at hand.

Massive multi-user games have failed because their strategic analysis confused the player-in-the-computer (who sought, say, silver) with the player-in-the-real-world (who sought fun, and cared for silver only insofar as that was fun). Technically flawless recommendation engines sometimes have no effect on user behavior, because even the best items were just boring to begin with. Once, I spent an hour trying to understand a sudden drop in the usage of a certain application in some countries but not in others, until I realized that it was Ramadan, and those countries were busy celebrating it.

Software programmers have to be nominalists — it's the pleasure and the privilege of coders to work, generally and as much as possible, in symbolic universes of self-contained elegance — and mathematicians are basically dedicated to the game of finding out how much truth can be gotten just from the symbols themselves. Being a bit of both, data analysts are very prone to lose themselves in the game of numbers, algorithms, and code. The trick is to be able to do so while also remembering that it's a lie — we might aim at having in our models as much of the complexity of the world as possible, but there's always (so far?) much more left outside, and it's part of the work of the analyst, perhaps her primary epistemological duty, to be alert to this, to understand how the Unrecorded might be the most important part of what she's trying to understand, and to be always open and eager to expand the model to embrace yet another aspect of the world.

The consequences of not doing this can be more than technical or economic. Contemporary civilization is impossible without the use of abstract data to understand and organize people, but the most terrible forms of contemporary barbarism, at the most demencial scales, would be impossible without the deliberate forgetfulness of the reality behind the data.

# Going Postal (in a self-quantified way)

Taking advantage of my regular gmvault backups of my Gmail account (which has been my main email account since mid-2007) I just made the following graph, which indicates the number of new email contacts (emails sent to people I had never emailed before) during each day, ignoring outliers, smoothing out trends, etc.

The graph as such looks relatively uninteresting, but armed with context about my last few years' of personal history (context which doesn't really belong in this space) the way the smoothed-out trends follow my life events is quite impressive (e.g., new jobs, periods of being relatively off-line, etc). Not much of a finding in these increasingly instrumentalized days, but it's a reminder, mostly to myself, of how much usefulness there can be in even the simplest time series, as long as you're measuring the right thing, and have the right context to evaluate it. We don't really have yet what technologists call the ecosystem (and might more properly be called, in a sociological sense, the institutions, or even the culture) for taking advantage of this kind of information and the feedback loops that it makes o possible; some of the largest companies in the world are fighting for this space, ostensibly to improve the efficiency of advertising, but that's the same as saying that the main effect of universal literacy was to facilitate the use of technical manuals.

Regarding the quantifiable part of our lives, we are as uninformed as any pre-literate people, and the growth (and, sometimes, redundancies) of the Quantified Self movement indicate both the presence of a very strong untapped demand for this information, and the fact that we haven't figured out yet how to use and consume it massively. Maybe we both want and don't want to know (psychological resistance to the concept of mortality as a key bottleneck for the success of personal health data vaults - there's a thought; some people shy away from even a superficial understanding of their financial situation, and that's a data model much much simpler than anything related to our bodies).

# Another movie space: Iron Man 3 and Stoker

Here's a redo of my previous analysis of a movie space based on The Aliens and the Unbearable Lightness of Being based on the logical itemset mining algorithm. I used the same technique, but this time leveraging the MovieTweetings data set maintained by Simon Dooms.

This movie space is sparser than the previous one, as the data set is smaller, but the examples seem to make sense (although I do wonder about where the algorithm puts About Time).

# The changing clusters of terrorism

I've been looking at the data set from the Global Terrorism Database, an impressively detailed register of terrorism events worldwide since 1970. Before delving into the more finer-grained data, the first questions I wanted to ask for my own edification where

• Is the frequency of terrorism events in different countries correlated?
• If so, does this correlation changes over time?

What I did was summarize event counts by country and month, segment the data set by decade, and build correlation clusters for the countries with the most events each decade depending on co-occurring event counts.

The '70s looks more or less how you'd expect them to:

The correlation between El Salvador and Guatemala, starting to pick up in the 1980's, is both expected and clear in the data. Colombia and Sri Lanka's correlation is probably acausal, although you could argue for some structural similarities in both conflicts:

I don't understand the 1990's, I confess (on the other hand, I didn't understand them as they happened, either):

The 2000's make more sense (loosely speaking): Afghanistan and Iraq are close, and so are India and Pakistan.

Finally, the 2010's are still ongoing, but the pattern in this graph could be used to organize the international terrorism-related section of a news site:

I find most interesting how the India-Pakistan link of the 2000's has shifted to a Pakistan-Afghanistan-Iraq one. Needless to say, caveat emptor: shallow correlations between small groups of short time series is only one step above throwing bones into the ground and reading the resulting patterns, in terms of analytic reliability and power.

That said, it's possible in principle to use a more detailed data set (ideally, including more than visible, successful events) to understand and talk about international relationships of this kind. In fact, there's quite sophisticated modeling work being done in this area, both academically and in less open venues. It's a fascinating field, and if it might not lead to less violence in any direct way, anything that enhances our understanding of, and our public discourse about, these matters is a good thing.

# The Aliens/The Unbearable Lightness of Being classification space of movies

Still playing with the Group Lens movies data set, I implemented a couple of ideas from Shailesh Kumar, one of the Google researchers that came up with the logical itemset mining algorithm. That improved the clustering of movies quite a bit, and gave me the idea to "choose a basis," so to speak, and project these clusters into a more familiar Euclidean representation (although networks and clusters are fast becoming part of our culture's vernacular, interestingly).

This is what I did: I chose two movies from the data set, Aliens and The Unbearable Lightness of Being as the "basis vectors" of the "movie space." For every other movie in the data set, I found the shortest path between the movie and each basis vector on the weighted graph in the logical itemset mining algorithm that underlies the final selection of clusters. That gave me a couple of coordinates for each movie (its "distance from Aliens" and "distance from The Unbearable..."). Rounding coordinates to integers and choosing an small sample that covers the space well, here's a selected map of "movie space" (you will want to click on it to see it at full size):

Agreeably enough, this map has a number of features you'd expect from something like this, as well as some interesting (to me) quirks:

• There is no movie that is close to both basis movies (although if anybody wants to produce The Unbearable Lightness of Chestbursters, I'd love to write that script).
• The least-The Unbearable... of the similar-to-Aliens movies in this sub-sample is Raiders of the Lost Ark, which makes sense (it's campy, but it's still an adventure movie).
• Dangerous Liaisons isn't that far from The Unbearable.., but as far away as you can get from Aliens.
• Wayne's World is way out there.

It's fun to imagine the use of geometrical analogies to use this kind of mapping for practical applications. For example, movie night negotiation between two or more people could be approached as finding the movie vector with the lowest euclidean norm among the available options, where the basis is the set of each person's personal choice or favorite movie, and so on.

# Latent mini-clusters of movies

Still playing with logical itemset mining, I downloaded one of the data sets from Group Lens that records movie ratings from MovieLens. The basic idea is the same as with clustering drug side effects: movies that are consistently ranked similarly by users are linked, and clusters in this graph suggest "micro-genres" of homogeneous (from a ratings POV) movies.

Here are a few of the clusters I got, practically with no fine-tuning of parameters:

• Parts II and III of the Godfather trilogy
• Ben-Hur and Spartacus
• The first three Indiana Jones movies
• Dick Tracy, Batman Forever, and Batman Returns.
• The Devil's Advocate and The Game.
• The first two Karate Kid movies.
• Analyze This and Analyze That.
• The 60's Lord of the Flies, the 1990 remake, and 1998's Apt Pupil

As movie clusters go, these are not particularly controversial; I found it interesting how originals and sequels or remakes seemed to be co-clustered, at least superficially. And thinking about it, clustering Apt Pupil with both Lord of the Flies movies is reasonable...

Media recommendation is by now a relatively mature field, and no single, untuned algorithm is going to be competitive against what's already deployed. However, given the simplicity and computational manageability of basic clustering and recommendation algorithms, I expect they'll become even more ubiquitous over time (pretty much as how autocomplete in input boxes did).

# Finding latent clusters of side effects

One of the interesting things about logical itemset mining, besides its conceptual simplicity, is the scope of potential applications. Besides the usual applications finding useful common sets of purchased goods or descriptive tags, the underlying idea of mixture-of, projections-of, latent [subsets] is a very powerful one (arguably, the reason why experiment design is so important and difficult is that most observations in the real world involve partial data from more than one simultaneous process or effect).

To play with this idea, I developed a quick-and-dirty implementation of the paper's algorithm, and applied it to the data set of the paper Predicting drug side-effect profiles: a chemical fragment-based approach. The data set includes 1385 different types of side effects potentially caused by 888 different drugs. The logical itemset mining algorithm quickly found the following latent groups of side effects:

• hyponatremia, hyperkalemia, hypokalemia
• impotence, decreased libido, gynecomastia
• nightmares, psychosis, ataxia, hallucinations
• neck rigidity, amblyopia, neck pain
• visual field defect, eye pain, photophobia
• rhinitis, pharyngitis, sinusitis, influenza, bronchitis

The groups seem reasonable enough (although hyperkalemia and hypokalemia being present in the same cluster is somewhat weird to my medically untrained eyes). Note the small size of the clusters and the specificity of the symptoms; most drugs induce fairly generic side effects, but the algorithm filters those out in a parametrically controlled way.

# A first look at phrase length distribution

Here's a sentence length vs. frequency distribution graph for Chesterton, Poe, and Swift, plus Time of Punishment.

A few observations:

• Take everything with a grain of salt. There are features here that might be artifacts of parsing and so on.
• That said, it's interesting that Poe seems to fancy short interjections more than Chesterton does (not as much as I do, though).
• Swift seems to have a more heterogeneous style in terms of phrase lengths, compared with Chesterton's more marked preference for relatively shorter phrases.
• Swift's average sentence length is about 31 words, almost twice Chesterton's 18 (Poe's is 21, and mine is 14.5). I'm not sure how reasonable that looks.
• Time of Punishment's choppy distribution is just an artifact of the low number of samples.

# The Premier League: United vs. City championship chances

Using the same model as previous posts (and, I'd say, not going against any intuition), the leading candidate to winning the Premier League is Manchester United, with approx. 88% chances. Second is Manchester City, with a bit over 11%. The rest of the teams with nonzero chances: Arsenal, Chelsea, Everton, Liverpool, Tottenham, and West Brom (with Chelsea, the best-positioned of these dark horses, clocking in at about half of a percentage point).

Personally, I'm happy about these very low-odds teams; I don't think any of them is likely to win (that's the point), but on the other hand, they have mathematical chances of doing so, and it's important for a model never to give zero probability to non-impossible events (modulo whatever precision you are working with, of course).

# Chesterton's magic word squares

Here are the magic word squares for a few of Chesterton's books. Whether and how they reflect characteristics that differentiate them from each other is left as an exercise to the reader.

### Orthodoxy

 the same way of this world was to it has and not think would always i have been indeed believed am no one thing which

### The Man Who Was Thursday

 the man of this agreement professor was his own you had the great president are been marquis started up as broken is not to be

### The Innocence of Father Brown

 the other side lay like priest in that it one of his is all right this head not have you agreement into an been are

### The Wisdom of Father Brown

 the priest in this time other was an agreement for side not be seen him explained to say you and father brown he had then

# Barcelona and the Liga, or: Quantitative Support for Obvious Predictions

I've adapted the predictive model to look at the Spanish Liga. Unsurprisingly, it's currently giving Barcelona a 96.7% chance of winning the title, with Atlético a far second place with 3.1%, and Real Madrid less than 0.2% (I believe the model still underestimates small probabilities, although it has improved in this regard.)

Note that around the 9th round or so, the model was giving Atlético an slightly higher chance of winning the tournament than Barcelona's, although that window didn't last more than a round.

# Magic Squares of (probabilistically chosen) Words

Thinking about magic squares, I had the idea of doing something roughly similar with words, but using usage patterns rather than arithmetic equations. I'm pasting below an example, using statistical data from Poe's texts:

### Poe

 the same manner as if most moment in this we intense and his head were excitement which i have no greatly he could not one

The word on the top-left cell in the grid is the most frequently used in Poe's writing, "the" — unsurprisingly so, as it's the most frequently used word in the English language. Now, the word immediately to its right, "same," is there because "same" is one of the words that follows "the" most often in the texts we're looking at. The word below "the" is "most" because it also follows "the" very often. "Moment" is set to the right of "most" and below "same" because it's the word that most frequently follows both.

The same pattern is used to fill the entire 5-by-5 square. If you start at the topmost left square and then move down and/or to the right, although you won't necessarily be constructing syntactically correct phrases, the consecutive word pairs will be frequent ones in Poe's writing.

Although there are no ravens or barely sublimated necrophilia in the matrix, the texture of the matrix is rather appropriate, if not to Poe, at least to Romanticism. To convince you of that, here are the equivalent 5-by-5 matrices for Swift and Chesterton.

### Swift

 the world and then he same in his majesty would manner a little that it of certain to have is their own make no more

### Chesterton

 the man who had been other with that no one and his it said syme then own is i could there are only think be

At least compared against each other, it wouldn't be too far fetched to say that Poe's matrix is more Poe's than Chesterton's, and vice versa!

PS: Because I had a sudden attack of curiosity, here's the 5-by-5 matrix for my newest collection of short stories, Time of Punishment (pdf link).

### Time of Punishment

 the school whole and even first dance both then four charge rants resistance they think of a hundred found leads punishment new astronauts month sleep

# The Torneo Inicial 2012 in one graph (and 20 subgraphs)

Here's a graph showing how the probability of winning the Argentinean soccer championship changed over time for each team (time goes from left to right, and probability goes from 0 at the bottom to 1 at the top). Click on the graph to enlarge:

Hindsight being 20/20, it's easy to read too much into this, but it's interesting to note that some qualitative features of how journalism narrated the tournament over time are clearly reflected in these graphs: Velez' stable progression, Newell's likelihood peak mid-tournament, Lanús quite drastic drop near the end, and Boca's relatively strong beginning and disappointing follow-through.

As an aside, I'm still sure that the model I'm using handles low-probability events wrong; e.g., Boca still had mathematical chances almost until the end of the tournament. That's something I'll have to look into when I have some time.

# Update to championship probabilities

 Team Championship probability Vélez Sarfield 85.4% Lanús 14.6%

# Soccer, Monte Carlo, and Sandwiches

As Argentina's Torneo Inicial begins its last three rounds, let's try to compute the probabilities of championship for each team. Our tools will be Monte Carlo and sandwiches.

The core modeling issue is, of course, trying to estimate the odds of team A defeating team B, given their recent history in the tournament. Because of the tournament format, teams only face each other one per tournament, and, because of the recent instability of teams and performance, generally speaking, performances in past tournaments won't be very good guides (this is something that would be interesting to look at in more detail). We'll use the following oversimplifications intuitions to make it possible to compute quantitative probabilities:

• The probability of a tie between two teams is a constant that doesn't depend on the teams.
• If team A played and didn't lose against team X, and team X played and didn't lose against team B, this makes it more likely than team A won't lose against team B (e.g., a "sandwich model").

Guided by these two observations, we'll take the results of the games in which a team played against both A and B as samples from a Bernoulli process with unknown parameter, and use this to estimate the probability of any previously unobserved game.

Having a way to simulate a given match that hasn't been played yet, we'll calculate the probability of any given team wining the championship by simulating the rest of the championship a million times, and observing in how many of these simulations each team wins the tournament.

The results:

 Team Championship probability Vélez Sarfield 79.9% Lanús 20.1%

Clearly our model is overly rigid — it doesn't feel at all realistic to say that those two teams are the only with any change of winning the champsionship. On the other hand, the balance of probabilities between both teams seems more or less in agreement with the expectations of observers. Given that the model we used is very naive, and only uses information from the current tournament, I'm quite happy with the results.

# A Case in Stochastic Flow: Bolton vs Manchester City

A few days ago the Manchester City Football Club released a sample of their advanced data set, an xml file giving a quite detailed description of low-level events in last year's August 21 Bolton vs. Manchester City game, which was won by the away team 3-2. There's an enormous variety of analyses that can be performed with this data, but I wanted to start with one of the basic ones, the ball's stochastic flow field.

The concept underlying this analysis is very simple. Where the ball will be in the next, say, ten seconds, depends on where it is now. It's more likely that it'll be near than it is that it'll be far, it's more likely that it'll be on an area of the field where the team with possession is focusing their attack, and so on. Thus, knowing the probabilities for where the ball will be starting from each point in the field — you can think of it as a dynamic heat map for the future — together with information about where it spent the most time, gives us information about how the game developed, and the teams' tactics and performance.

Sadly, a detailed visualization of this map would require at least a four-dimensional monitor, so I settled for a simplified representation, splitting the soccer field in a 5x5 grid, and showing the most likely transitions for the ball from one sector of the field to another. The map is embedded below; do click on it to expand it, as it's not really useful as a thumbnail.

Remember, this map shows where the ball was most likely to go from each area of the field; each circle represents one area, with the circles at the left and right sides representing the area all the way to the end lines. Bigger circles signal that the ball spent more time in that area, so, e.g., you can see that the ball spent quite a bit of time in the midfield, and very little on the sides of Manchester City's defense line. The arrows describe the most likely movements of the ball from one area to another; the wider the line, the most likely the movement. You can see how the ball circulated side-to-side quite a bit near Bolton's goal, while Manchester City kept the ball moving further away from their goal.

There are many immediate questions that come to mind, even with such a simplified representation. How does this map look according to which team had possession? How did it change over time? What flow patterns are correlated with good or bad performance on the field? The graph shows the most likely routes for the ball, but which ones were the most effective, that is, more likely to end up in a goal? Because scoring is a rare event in soccer, particularly compared with games like tennis or american football, this kind of analysis is specially challenging, but also potentially very useful. There's probably much that we don't know yet about the sport, and although data is only an adjunct to well-trained expertise, it can be a very powerful one.

# Washington DC and the murderer's work ethic

Continuing what has turned out to be a fun hobby of looking at crime data for different cities (probably among the most harmless of crime-related hobbies, as long as you aren't taking important decisions based on naive interpretations of badly understood data), I went to data.dc.gov and downloaded Crime Incident data for the District of Columbia for the year of 2011.

Mapping it was the obvious move, but I already did that for Chicago (and Seattle, although there were issues with the data, so I haven't posted anything yet), so I looked at an even more basic dimension: the time series of different types of crime.

To begin with, here's the week-by-week normalized count of thefts (not including burglary, and thefts from cars) in Washington DC (click to enlarge):

I normalized this series by shifting it to its mean and scaling it by its standard deviation — not because the data is normally distributed (it actually shows a thick left tail), but because I wanted to compare it with another data series. After all, the form of the data, partial as it is, suggests seasonality, and as the data covers a year, it wants to be checked against, say, local temperatures.

Thankfully NOAA offers just this kind of data (through about half a dozen confusingly overlapping interfaces), so I was able to add to the plot the mean daily temperature for DC (normalized in the same way as the theft count):

The correlation looks pretty good! (0.7 adj. R squared, if you must know it). Not that this proves any sort of direct causal chain, that's what controlled experiments are for, but we can postulate, e.g., a naive story where higher temperatures mean more foot traffic (I've been in DC in winter, and the neoclassical architecture is not a good match for the latitude), and more foot traffic leads to richer pickings for thieves (an interesting economics aside: would this mean that the risk-adjusted return to crime is high enough that crime is, as it were, constrained by the victims supply?)

Now let's look at murder.

The homicide time series is quite irregular, thanks to a relatively low (for, say Latin American values of 'low') average homicide count, but it's clear enough that there isn't a seasonal pattern to homicides, and no correlation with temperature (a linear fitting model confirms this, not that it was necessary in this case). This makes sense if we imagine that homicide isn't primarily an outdoors activity, or anyway that your likelihood of being killed doesn't increase as you spend more time on the street (most likely, whoever wants to kill you is motivated by reasons other than, say, an argument over street littering). Murder happens come rain or snow (well, I haven't checked that; is there an specific murder weather?)

Another point of interest is the spike of (weather-normalized) theft near the end of the year. It coincides roughly with Thanksgiving, but if that's the causal link, I'd be interested in knowing exactly what's going on.

# How Rooney beats van Persie, or, a first look at Premier League data

I just got one of the data sets from the Manchester City analytics initiative, so of course I started dipping my toe in it. The set gives information aggregated by player and match for the 2011-2012 Premier League, in the form of a number of counters (e.g. time played, goals, headers, blocked shots, etc); it's not the really interesting data set Manchester City is about to release (with, e.g., high-resolution position information for each player), but that doesn't mean there aren't interesting things to be gleaned from it.

The first issue I wanted to look at is probably not the most significant in terms of optimizing the performance of a team, but it's certainly one of the most emotional ones. Attackers: Who's the best? Who's underused? Who sucks?

If you look at total goals scored, the answer is easy: the best attackers are van Persie (30 goals), Rooney (27 goals), and Agüero (23 goals). Controlling by total time played, though, Berbatov and both Cissés have been quite more efficient in goals scored by minute played. They are also, not coincidentally, the most efficient scorers in terms of goals per shoot (both on and off target). The 30 goals of van Persie, for example, are more understandable when you see that he shot 141 times for a goal, versus Berbatov's 15.

To see how shooting efficiency and shooting volume (number of shoots) interact with each other, I made this scatterplot of goals per shoot versus shoots per minute, restricted to players who regularly shoot to avoid low-frequency outliers (click to expand).

You can see that most players are more or less uniformly distributed in the lower-left quadrant of low shooting volume and low shooting efficiency — people who are regular shooters, so they don't try too often or too seldom. But there are outliers, people who shoot a lot, or who shoot really well (or aren't as closely shadowed by defenders)... and they aren't the same. This suggests a question: Who should shoot less and pass more? And who should shoot more often and/or get more passes?

To answer that question (to a very sketchy first degree approximation), I used the data to estimate a lost goals score that indicates how many more goals per minute could be expected if the player made a successful pass to an average player instead of shooting for a goal (I know, the model is naive, there are game (heh) theoretic considerations, etc; bear with me). Looking at the players through this lens, this is a list of players who definitely should try to pass a bit more often: Andy Carroll, Simon Cox, and Shaun Wright-Phillips.

Players who should be receiving more passes and making more shots? Why, Berbatov and both Cissés. Even Wayne Rooney, the league's second most prolific shooter, is good enough turning attempts into goals that he should be fed the ball more often, rather than less.

The second-order question, and the interesting one for intra-game analysis, is how teams react to each other. To say that Manchester United should get the ball to Rooney inside strike distance more often, and that opposing teams should try to prevent this, is as close to a triviality as can be asserted. But whether or not an specific change to a tactical scheme to guard Rooney more closely will be a net positive or, by opening other spaces, backfire... that will require more data and a vastly less superficial analysis.

And that's going to be so much fun!

# Crime in Argentina

As a follow-up to my post on crime patterns in Chicago, I wanted to do something similar for Argentina. I couldn't find data at the same level of detail, but the people of Junar, who develop and run an Open Data platform, were kind enough to point me to a few data sets of theirs, including one that lists crime reports by type across Argentinean provinces for the year 2007.

The first issue I wanted to see was the relationship between different types of crime. Of course, properly speaking you need far more data, and a far more sophisticated and domain-specific analysis, to even begin to address the question, but you can at least see what types of crime tend to happen (or to be reported) in the same provinces. Here's a dendogram showing the relationships between crimes (click to expand it):

As you can see, crimes against property and against the state tend to happen in the same provinces, while more violent crimes (homicide, manslaughter, and kidnapping) are more highly correlated with each other. Drugs, which may or may not surprise you, are more correlated with property crimes than with violent crimes. Sexual crimes are not correlated, at least at the province level, with either cluster or crimes.

This observation suggests that we can plot provinces on the property crimes/sexual crimes space, as they seem to be relatively independent types of crime (at least at the province level). I added the line that marks a best fit linear relationship between both types of crime (mostly related, we'd expect, through their populations).

A few observations from this graph:

• The bulk of provinces (the relatively small ones) are on the lower left corner of the graph, mostly below the linear relationship line. The ones above the line, with a higher rate of sexual crimes as expected from the number of property crimes, are provinces on the North.
• Salta has, unsurprisingly but distressingly, almost four times the number of sexual crimes than expected by the linear relationship. Córdoba, the Buenos Aires province, and, to a lesser degree, Santa Fé, have also higher-than-expected numbers.
• Despite ranking fourth in terms of absolute number of sexual crimes, the City of Buenos Aires has much fewer than the number of property crimes would imply (or, equivalently, has a much higher number of property crimes than expected).

Needlessly to say, this is but a first shallow view, using old data with poor resolution, of an immensely complex field. But looking at data, through never the only or last step when trying to understand something, it's almost always a necessary one, and it never fails to interest me.

# Chicago and the Tree of Crime

After playing with a toy model of surveillance and surveillance evasion, I found the City of Chicago's Data Portal, a fantastic resource with public data including the salaries of city employees, budget data, the location of different service centers, public health data, and quite detailed crime data since 2001, including the relatively precise location of each reported crime. How could I resist playing with it?

To simplify further analysis, let's quantize the map into a 100x100 grid. Here's, then, the overall crime density of Chicago (click to enlarge):

This map shows data for all crime types. One first interesting question is whether different crime types are correlated. E.g., do homicides tend to happen close to drug-related crimes? To look a this, I calculated the correlation between the different types of crimes at the same point of the grid, and from that I built a "tree of crime." Technically called a dendogram, this kind of plot is akin to a phylogenetic tree, and in fact it's often used to show evolutionary relationships. In this case, the tree shows the closeness or not, in terms of geographical correlation, between types of crimes: the closer two types of crime are in the tree, the more likely they are to happen in the same geographical area (click to enlarge).

A few observations:

• I didn't clean up the data before analysis, as I was as interested in the encoding details as in the semantics. The fact that two different codes for offenses involving children are closely related in the dendogram is good news in terms of trusting the overall process.
• The same goes for assault and battery; as expected, they tend to happen in the same places.
• I didn't expect homicide and gambling to be so closely related. I'm sure there's something interesting (for laypeople like me) going on there.
• Other sets of closely related crimes that aren't that surprising: sex offenses and stalking, criminal trespass and intimidation, and prostitution-liquor-theft.
• I expected narcotics and weapons to be closely related, but what's arson doing in there with them? Do street-level drug sellers tend to work in the same areas where arson is profitable?

For law enforcement — as for everything else — data analysis is not a silver bullet, and pretending it is can lead to shooting yourself in the face with it (the mixed metaphor, I hope, is warranted by the topic). But it can serve as a quick and powerful way to pose questions and fight our own preconceptions, and, perhaps specially in highly emotional issues like crime, that can be a very powerful weapon.

# A quick look at Elance statistics

I collected data from Elance feeds, in order to find what employers are looking for on the site. It's not pretty: by far the most requested skills in terms of aggregated USD demand are article writing (generally "SEO optimized"), content, logos, blog posting, etc. In other words, mostly AdSense baiting with some smattering of design. It's not everything requested on Elance, of course, but it's a big part of the pie.

Not unexpected, but disappointing. Paying low wages to people in order to fool algorithms to get other people to pay a bit more might be a symbolically representative business model in an increasingly algorithm-routed and economically unequal world, but it feels like a colossal misuse of brains and bits.