Category Archives: Inference

Probability-as-logic vs probability-as-strategy vs probability-as-measure-theory

Attention conservation notice: Elementary (and possibly not-even-right) if you have the relevant mathematical background, pointless if you don't. Written to help me clarify to myself a moment of categorical (pun not intended) confusion.

What's a possible way to understand the relationship between probability as a the (by Cox) extension of classical logic, probability as an optimal way to make decisions, and probability in the frequentist usage? Not in any deep philosophical sense, just in terms of pragmatics.

I like to begin from the Bayes/Jaynes/Cox view: if you take classical logic as valid (which I do in daily life) and want to extend it in a consistent way to continuous logic values (which I also do), then you end up with continuous logic/certainty values we unfortunately call probability due to historical reasons.

Perhaps surprisingly, its relationship with frequentist probability isn't necessarily contentious. You can take the Kolmogorov axioms as, roughly speaking, helping you define a sort of functor (awfully, based on shared notation and vocabulary, an observation that made me shudder a bit — it's almost magical thinking) between the place where you do probability-as-logic and a place where you can exploit the machinery of measure theory. This is a nice place to be when you have to deal with an asymptotically large number of propositions; possibly the Probability Wars were driven mostly by doing this so implicitly that we aren't clear about what we're putting *into* this machinery, and then, because the notation is similar, forgetting to explicitly go back to the world of propositions, which is where we want to be once we're done with the calculations.

What made me stare a bit at the wall is the other correspondence: Let's say that for some proposition A, P[A] > P[\neg A] in the Bayesian sense (we're assuming the law of excluded middle, etc; this is about a different kind of confusing). Why should I bet that A? In other words, why the relationship between probability-as-certainty and probability-as-strategy? You can define probability based on a decision theoretic point of view (and historically speaking, that's how it was first thought of), but why the correspondence between those two conceptually different formulations?

It's a silly non-question with a silly non-answer, but I want to record it because it wasn't the first thing I thought of. I began by thinking about P[\text{win} | (P(A) > P(\neg A)) \wedge \text{bet on } A], but that leads to a lot of circularity. It turns out that the forehead-smacking way to do it is simply to observe that the best strategy is to bet on A is true iff A, and this isn't circular if we haven't yet assumed that probability-as-strategy is the same as probability-as-logic, but rather it's a non-tautological consequence of the assumed psychology and sociology of what bet on means: I should've done whatever ended up working, regardless of what the numbers told me (I'll try to feel less upset the next time somebody tells me that).

But then, in the sense of probability-as-logic, P[\text{the best strategy is to bet on A}] = P[A] by substituting propositions (and hence without resorting to any frequentist assumption about repeated trials and the long term) so, generally speaking, you end up with probability-as-strategy being part of probability-as-logic. I'm likely counting angels dancing on infinitesimals here, but it's something it felt less clear to me earlier today: probability-as-strategy is probability-as-logic, you're just thinking about propositions about strategies, which, confusingly, in the simplest cases end up having the same numerical certainty values as the propositions the strategies are about. But those aren't the same propositions, although I'm not entirely sure that in practice, given the fundamentally intuitive nature of bet on (insert here very handwavy argument from evolutionary psychology about how we all descend from organisms who got this well enough not to die before reproducing), you get in trouble by not taking this into account.

Regularization, continuity, and the mystery of generalization in Deep Learning

A light and short note on a dense subset of a large space...

There's increasing interest in the very happy problem of why Deep Learning methods generalize so well in real-world usage. After all,

  • Successful networks have ridiculous amounts of parameters. By all rights, they should be overfitting training data and doing awfully with new data.
  • In fact, they are large enough to learn the classification of entire data sets even with random labels.
  • And yet, they generalize very well.
  • On the other hand, they are vulnerable to adversarial attacks with weird and entirely unnatural-looking inputs.

One possible very informal way to think about this — I'm not claiming it's an explanation, just a mental model I'm using until the community reaches a consensus as to what's going on — is the following:

  • If the target functions we're trying to learn are (roughly speaking) nicely continuous (a non-tautological but often true property of the real world, where, e.g., changing a few pixels of a cat's picture rarely makes it cease to be one)...
  • and regularization methods steer networks toward that sort of functions (partly as a side effect of trying to avoid nasty gradient blowups)...
  • and your data set is more or less dense in whatever subset of all possible inputs is realistic...
  • ... then, by a frankly metaphorical appeal to a property of continuous functions in Hausdorff spaces, learning well the target function on the training set implies learning well the function on the entire subset.

This is so vague that I'm having trouble keeping myself from making a political joke, but I've found it a wrong but useful model to think about how Deep Learning works (together with an, I think, essentially accurate model of Deep Learning as test-driven development) and how it doesn't.

As a bonus, this gives a nice intuition about why networks are vulnerable to weird adversarial inputs: if you only train the network with realistic data, no matter how large your data set, the most you can hope for is for it to be dense on the realistic subset of all possible inputs. Insofar as the mathematical analogy holds, you only get a guarantee of your network approximating the target function wherever you're dense; outside that subset — in this case, for unrealistic, weird inputs — all bets are off.

If this is true, protecting against adversarial examples might require some sort of specialized "realistic picture of the world" filters, as better training methods or more data won't help (in theory, you could add nonsense inputs to the data set so it can learn to recognize and reject it, but you'd need to pretty much cover the entire input subset with a dense set of samples, and if you're going to do that, then you might as well set up a lookup table, because you aren't anymore).

Statistics, Simians, the Scottish, and Sizing up Soothsayers

A predictive model can be a parametrized mathematical formula, or a complex deep learning network, but it can also be a talkative cab driver or a slides-wielding consultant. From a mathematical point of view, they are all trying to do the same thing, to predict what's going to happen, so they can all be evaluated in the same way. Let's look at how to do that by poking a little bit into a soccer betting data set, and evaluating it as if it were an statistical model we just fitted.

The most basic outcome you'll want to predict in soccer is whether a game goes to the home team, the visitors or away team, or is a draw. A predictive model is anything and anybody that's willing to give you a probability distribution over those outcomes. Betting markets, by giving you odds, are implicitly doing that: the higher the odds, the less likely they think is the outcome.

The Football-Data.co.uk data set we'll use contains results and odds from various soccer leagues for more than 37,000 games. We'll use the odds for the Pinnacle platform whenever available (those are closing odds, the last ones available before the game).

For example, for the Juventus-Fiorentina game in August 20, 2016, the odds offered were 1.51 for a Juventus win, 4.15 for a draw (ouch), and 8.61 for a Fiorentina victory (double ouch). Odds of 1.51 for Juventus mean that for each dollar you bet on Juventus, you'd get USD 1.51 if Juventus won (your initial bet included) and nothing if it didn't. These numbers aren't probabilities, but they imply probabilities. If platforms gave odds too high relative to the event's probability they'd go broke, while if they gave odds too low they wouldn't be able to attract bettors. On balance, then, we can read from the odds probabilities slightly lower than the the betting market's best guesses, but, in a world with multiple competing platforms, not really that far from the mark. This sounds like a very indirect justification for using them as a predictive model, but every predictive model, no matter how abstract, has a lot of assumptions; a linear model assumes the relevant phenomenon is linear (almost never true, sometimes true enough), and looking at a betting market as a predictive model assumes the participants know what they are doing, the margins aren't too high, and there isn't anything too shady going on (not always true, sometimes true enough).

We can convert odds to probabilities by asking ourselves: if these odds were absolutely fair, how probable would the event have to be so neither side of the bet can expect to earn anything? (a reasonable definition of "fair" here, with historical links to the earliest developments of the concept of probability). Calling P the probability and L the odds, we can write this condition PL + (1-P)*0 = 1. The left side of the equation is how much you get on average — L when, with probability P, the event happens, and zero otherwise — and the right side says that on average you should get you dollar back, without winning or losing anything. From there it's obvious that P = \frac{1}{L}. For example, the odds above, if absolutely fair (which they never are, not completely, as people in the industry have to eat) would imply a probability for Juventus to win of 66.2%, and for Fiorentina of 11.6% (for the record, Juventus won, 2-1).

In this way we can put information into the betting platform (actually, the participants do), and read out probabilities. That's all we need to use it as a predictive model, and there's in fact a small industry dedicated to building betting markets tailored to predict all sorts of events, like political outcomes; when built with this use in mind, they are called prediction or information markets. The question, as with any model, isn't if it's true or not — unlike statistical models, betting markets don't have any misleading aura of mathematical certainty — but rather how good those probabilities are.

One natural way of answering that question is to compare our model with another one. Is this fancy machine learning model better than the spreadsheet we already use? Is this consultant better than this other consultant? Is this cab driver better at predicting games than that analyst on TV? Language gets very confusing very quickly, so mathematical notation becomes necessary here. Using the standard notation  P[x | y] for how likely do I think is that x will happen if y is true?, we can compare the cab driver and the TV analyst by calculating

 \frac{P[ \textrm{the game results we saw} | \textrm{the cab driver knows what she's talking about}]}{P[\textrm{the game results we saw} | \textrm{the TV analyst knows what he's talking about}]}

If that ratio is higher than one, this means of course that the cab driver is better at predicting games than the TV analyst, as she gave higher probabilities to the things that actually happened, and vice versa. This ratio is called the Bayes factor.

In our case, the factors are easy to calculate, as P[\textrm{home win} | \textrm{odds are good predictors}] is just \textrm{probability of a home win as implied by the odds}, which we already know how to calculate. And because the probabilities of independent events are the product of the individual probabilities, then

P[\textrm{any sequence of game results}|\textrm{odds are good predictors}] = \prod P[\textrm{probability of each result as implied by the odds}]

In reality, those events aren't independent, but we're assuming participants in the betting market take into account information from previous games, which is part of what "knowing what you're talking about" intuitively means.

Note how we aren't calculating how likely a model is, just which one of one two models has more support from the data we're seeing. To calculate the former value we'd need more information (e.g., how much you believed the model was right before looking at the data). This is a very useful analysis, particularly when it comes to making decisions, but often the first question is a comparative one.

Using our data set, we'll compare the betting market as a predictive model against a bunch of dart-throwing chimps as a predictive model (dart-throwing chimps are a traditional device in financial analysis). The chimps throw darts against a wall covered with little Hs, Ds, and As, so they always predict each event has a probability of \frac{1}{3}. Running the numbers, we get

 \textrm{odds vs chimps} = \frac{\prod P[\textrm{probability of each result as implied by odds}]}{ \frac{1}{3}^{\textrm{number of games}}} = e^{4312.406}

This is (much) larger than one, so the evidence in the data favors the betting market over the chimps (very; see the link above for a couple of rules of thumb about interpreting those numbers). That's good, and not something to be taken for granted: many stock traders underperform chimps. Note that if one model is better than another, the Bayes factor comparing them will keep growing as you collect more observations and therefore become more certain of it. If you make the above calculation with a smaller data set, the resulting Bayes factor will be lower.

Are odds also better in this sense than just using a rule of thumb about how frequent each event is? In this data set, the home team wins about 44.3% of the time, and the visitors 29%, so we'll assign those outcome probabilities to every match.

 \textrm{odds vs rule of thumb} = \frac{\prod P[\textrm{probability of each result as implied by odds}]}{ \prod P[\textrm{probability of each result as implied by the rule of thumb}]   } = e^{3342.303}

That's again overwhelming evidence in favor of the betting market, as expected.

We have statistics, soothsayers, and simians (chimpanzees aren't simians, but I couldn't resist the alliteration). What about the Scottish?

Lets look at how better than chimps are the odds for different countries and leagues or divisions (you could say that the chimps are our null hypothesis, but the concept of null hypothesis is at best a confusing and at worst a dangerous one: quoting the Zen of Python, explicit is better than implicit). The calculations will be the same, applied to subsets of the data corresponding to each division. A difference is that we're going to show the logarithm of the Bayes factor comparing the model implied by the odds and the model from the dart-throwing chimps (otherwise numbers become impractically large), and this divided by the number of game results we have for each division. Why that division? As we said above, if one model is better than another, the more observations you accumulate, the higher the amount of evidence for one over the other you're going to get. It's not that the first model is getting better over time, it's just that you're getting more evidence that it's better. In other words, if model A is slightly better than model B but you have a lot of data, and model C is much better than model D but you only have a bit of data, then the Bayes factor between A and B can be much larger than the one between C and D: the size of an effect isn't the same thing as your certainty about it.

By dividing the (logarithm of) the Bayes factor by the number of games, we're trying to get a rough idea of how good the odds are, as models, comparing different divisions with each other. This is something of a cheat — they aren't models of the same thing! — but by asking of each model how quickly they build evidence that they are better than our chimps, we get a sense of their comparative power (there are other, more mathematically principled ways of doing this, and to a degree the method you choose has to depend on your own criteria of usefulness, which depends on what you'll use the model for, but this will suffice here).

I'm following here the naming convention for divisions used in the data set: E0 is the English Premier League, E1 is their Championship, etc (the larger the number, the "lower" the league), and the country prefixes are: E for England, SC for Scotland, D for Germany, I for Italy, SP for Spain, F for France, N for the Netherlands, B for Belgium, P for Portugal, T for Turkey, and G for Greece. There's quite a bit of heterogeneity inside each country, but with clear patterns. To make them clearer, let's sort the graph by value instead of division, and keep only the lowest and highest five:

The betting odds generate better models for the top leagues of Greece, Portugal, Spain, Italy, and England, and worse ones for the lower leagues, with the very worst modeled one being SC3 (properly speaking, the Scottish League Two – there are the Scottish). This makes sense: the larger leagues have a lot of bettors who want in, many of them professionals, so the odds are going to be more informative.

To go back to the beginning: everything that gives you probabilities about the future is a predictive model. Just because one is a betting market and the other is a chimpanzee, or one is a consultant and the other one is a regression model, it doesn't mean they can't and shouldn't be compared to each other in a meaningful way. That's why it's so critical to save the guesses and predictions of every software model and every "human predictor" you work with. It lets you go back over time and ask the first and most basic question in predictive data science:

How much better is this program or this guy than a chimp throwing darts?

When you think about it, is that really a question you would want to leave unanswered about anything or anybody you work with?

Deep Learning as the apotheosis of Test-Driven Development

Even if you aren't interested in data science, Deep Learning is an interesting programming paradigm; you can see it as "doing test-driven development with a ludicrously large number of tests, an IDE that writes most of the code, and a forgiving client." No wonder everybody's pouring so much money and brains into it! Here's a way of thinking about Deep Learning not as an application you're asked to code, but a language to code with.

Deep Learning applies test-driven development as we're all taught to (and not always do): first you write the tests, and then you move from code that fails all of them to one that passes them all. One difference from the usual way of doing it, and the most obvious, is that you'll usually have anything from hundreds of thousands to Google-scale numbers of test cases in the way of pairs (picture of a cat, type of cute thing the cat is doing), or even a potentially infinite number that look like pairs (anything you try, how badly Donkey Kong kills you). This gives you a good chance that, if you selected or generated them intelligently, the test cases represent the problem well enough that a program that passes them will work in the wild, even if the test cases are all you know about the problem. It definitely helps that for most applications the client doesn't expect perfect performance. In a way, this lets you get away with the problem of having to get and document domain knowledge, at least for reasonable-but-not-state-of-the-art levels of performance, which is specially hard to do for to things like understanding cat pictures, because we just don't know how we do it.

The second difference between test-driven development with the usual tools and test-driven development with Deep Learning languages and runtimes is that the latter are differentiable. Forget the mathematical side of that: the code monkey aspect of it is that when a test case fails, the compiler can fix the code on its own.

Yep.

Once you stop thinking about neural networks as "artificial brains" or data science-y stuff, and look at them as a relatively unfamiliar form of bytecode — but, as bytecode goes, also a fantastically simple one — then all that hoopla about backpropagation algorithms is justified, because they do pretty much what we do: look at how a test failed and then work backwards through the call stack, tweaking things here and there, and then running the test suite again to see if you fixed more tests than you broke. But they do it automatically and very quickly, so you can dedicate yourself to collecting the tests and figuring out the large scale structure of your program (e.g. the number and types of layers in your network, and their topology) and the best compiler settings (e.g., optimizing hyperparameters and setting up TensorFlow or whatever other framework you're using; they are labeled as libraries and frameworks, but they can also be seen as compilers or code generators that go from data-shaped tests to network-shaped bytecode).

One currently confusing fact is that this is all rather new, so very often the same people who are writing a program are also improving the compiler or coming up with new runtimes, so it looks like that's what programming with Deep Learning is about. But that's just a side effect of being in the early "half of writing the program is improving gcc so it can compile it" days of the technology, where things improve by leaps and bounds (we have both a fantastic new compiler and the new Internet-scale computers to run it), but are also rather messy and very fun.

To go back to the point: from a programmer's point of view, Deep Learning isn't just a type of application you might be asked to implement. They are also a language to write things with, one with its own set of limitations and weak spots, sure, but also with the kind of automated code generation and bug fixing capabilities that programmers have always dreamed of, but by and large avoid because doing it with our usual languages involves a lot of maths and the kind of development timelines that makes PMs either laugh or cry.

Well, it still does, but with the right language the compiler takes care of that, and you can focus on high-level features and getting the test cases right. It isn't the most intuitive way of working for programmers trained as we were, and it's not going to fully replace the other languages and methods in our toolset, but it's solving problems that we thought were impossible. How can a code monkey not be fascinated by that?

How to be data-driven without data...

...and then make better use of the data you get.

The usefulness of data science begins long before you collect the first data point. It can be used to describe very clearly your questions and your assumptions, and to analyze in a consistent manner what they imply. This is neither a simple exercise nor an academic one: informal approaches are notoriously bad at handling the interplay of complex probabilities, yet even the a priori knowledge embedded in personal experience and publicly available research, when properly organized and queried, can answer many questions that mass quantities of data, processed carelessly, wouldn't be able to, as well as suggest what measurements should be attempted first, and what for.

The larger the gap between the complexity of a system and the existing data capture and analysis infrastructure, the more important it is to set up initial data-free (which doesn't mean knowledge-free) formal models as a temporary bridge between both. Toy models are a good way to begin this approach; as the British statistician George E.P. Box wrote, all models are wrong, but some are useful (at least for a while, we might add, but that's as much as we can ask of any tool).

Let's say you're evaluating an idea for a new network-like service for specialized peer-to-peer consulting that will have the possibility of monetizing a certain percentage of the interactions between users. You will, of course, capture all of the relevant information once the network is running — and there's no substitute for real data — but that doesn't mean you have to wait until then to start thinking about it as a data scientist, which in this context means probabilistically.

Note that the following numbers are wrong: it takes research, experience, and time to figure out useful guesses. What matters for the purposes of this post is describing the process, oversimplified as it will be.

You don't know a priori how large the network will be after, say, one year, but you can look at other competitors, the size of the relevant market, and so on, and guess, not a number ("our network in one year will have a hundred thousand users"), but the relative likelihood of different values.

The graph above shows one possible set of guesses. Instead of giving a single number, it "says" that there's a 50% chance that the network will have at least a hundred thousand users, and a 5.4% chance that it'll have at least half a million (although note that decimals points in this context are rather pointless; a guess based on experience and research can be extremely useful, but will rarely be this precise). On the other hand, there's almost a 25% chance that the network will have less than fifty thousand users, and a 10% chance that it'll have less than twenty-eight thousand.

How do you build such a graph, or rather, how do you assemble the information represented on it? The answer will probably look surprisingly old-fashioned: by learning as much as you can about the topic, talking with people who know about it, exercising your judgment, and then using formal mathematics to force yourself to write your best guess in a way that's explicitly clear about what it says and what it doesn't. The first steps are things you were already doing to help you with your problem, but the last one is what will allow you to coordinate knowledge and experience from different sources to give you the best possible answer to your question, given whatever you know at that moment.

You can use the same process to codify your educated guesses about other key aspects of the application, like the rate at which members of the network will interact, and the average revenue you'll be able to get from each interaction. As always, neither these numbers nor the specific shape of the curves matter for this toy example, but note how different degrees and forms of uncertainty are represented through different types of probability distributions:

Clearly, in this toy model we're sure about some things like the interaction rate (measured, say, in interactions per month), and very unsure about others, like the average revenue per interaction. Thinking about the implications of multiple uncertainties is one of the toughest cognitive challenges, as humans tend to conceptualize specific concrete scenarios: we think in terms of one or at best a couple of states of the world we expect to happen, but when there are multiple interacting variables, even the most likely scenario might have a very low absolute probability.

Simulation software, though, makes this nearly trivial even for the most complex models. Here's, for example, the distribution of probabilities for the monthly revenue, as necessarily implied by our assumptions about the other variables:

There are scenarios where your revenue is more than USD 10M per month, and you're of course free to choose the other variables so this is one of the handful of specific scenarios you describe (perhaps the most common and powerful of the ways in which people pitching a product or idea exploit the biases and limitations in human cognition). But doing this sort of quantitative analysis forces you to be honest at least to yourself: if what you know and don't know is described by the distributions above, then you aren't free to tell yourself that your chance of hitting it big is other than microscopic, no matter how clear the image might be in your mind.

That said, not getting USD 10M a month doesn't mean the idea is worthless; maybe you can break even and then use that time to pivot or sell it, or you just want to create something that works and is useful, and then grow it over time. Either way, let's assume your total costs are expected to be USD 200k per month (if this were a proper analysis and not a toy example, this wouldn't be an specific guess, but another probability distribution based on educated guesses, expert opinions, market surveys, etc). How do probabilities look then?

You can answer this question using the same sort of analysis:

The inescapable consequence of your assumptions is that your chances of breaking even are 1 in 20. Can they be improved? One advantage of fully explicit models is that you can ask not just for the probability of something happening, but also about how things depend on each other.

Here are the relationships between the revenue, according to the model, and each of the main variables, with a linear best fit approximation superimposed:

As you can see, network size has the clearest relationship with revenue. This might look strange – wouldn't, under this kind of simple model, multiplying by ten the number of interactions keeping the monetization rate also multiply by ten the revenue? Yes, but your assumptions say you can't multiply the number of interactions by more than a factor of five, which, together with your other assumptions, isn't enough to move your revenue very far. So it isn't that it's unreasonable to consider the option of increasing interactions significantly, to improve your chances of breaking even (or even getting to USD 10M). But if you plan to increase outside the explicit range encoded your assumptions, you have to explain why they were wrong. Always be careful when you do this: changing your assumptions to make possible something that would be useful if it were possible is one of humankind's favorite ways of driving directly into blind alleys at high speed.

It's key to understand that none of this is really a prediction about the future. Statistical analysis doesn't really deal with predicting the future or even getting information about the present: it's all about clarifying the implications of your observations and assumptions. It's your job to make those observations and assumptions as good and releevant as possible, both not leaving out anything you know, and not pretending you know what you don't, or that your are more certain about something that you should be.

This problem is somewhat mitigated for domains where we have vast amounts of information, including, recently, areas like computer vision and robotics. But we have yet to achieve the same level of data collection in other key areas like business strategy, so there's no way of avoiding using expert knowledge... which doesn't mean, as we saw, that we have to ditch quantitative methods.

Ultimately, successful organizations do the entire spectrum of analysis activities: they build high-level explicit models, encode expert knowledge, collect as much high-quality data as possible, train machine learning models based on that, and exploit all of that for strategic analysis, automatization, predictive modeling, etc. There are no silver bullets, but you probably have more ammunition than you think.

The job of the future isn't creating artificial intelligences, but keeping them sane

Once upon a time, we thought there was such a thing as bug-free programming. Some organizations still do — and woe betide their customers — but after a few decades hitting that particular wall, the profession has by and large accepted that writing software is such an extremely complex intellectual endeavor that errors and unfounded assumptions are unavoidable. Even the most mathematically solid of formal methods has, if nothing else, to interact with a world of unstable platforms and unreliable humans, and what worked today will fail tomorrow.

So we spend time and resources maintaining what we already "finished," fixing bugs as they are found, and adapting programs to new realities as they develop. We have to, because when we don't, as when physical infrastructure isn't maintained, we save resources in the short term, but only in our way towards protracted ruin.

It's no surprise that this also happens with our most sofisticated data-driven algorithms. CVs and scrum boards are filled with references to the maintenance of this or that prediction or optimization algorithm.

But there's a subtle, not universal but still very prevalent, problem: those aren't software bugs. This isn't to say that implementations don't have bugs; being software, they do. But they are computer programs implementing inference algorithms, which work at a higher level of abstraction, and those have their own kinds of bugs, and those don't leave stack traces behind.

A clear example is the experience of Google. PageRank was, without a doubt, among the most influential algorithms in the history of the internet, not to mention the most profitable, but as Google took the internet over by storm, gaming PageRank became such an important business activity that "SEO" became a commonplace word.

From an algorithmic point of view this simply a maintenance problem: PageRank assumed a certain relationship between link structure and relevance, based on the assumption that website creators weren't trying to fool it. Once this assumption became untenable, the algorithm had to be modified to cope with a world of link farms and text written with no human reader in mind.

In (very loosely equivalent) software terms, there was a new threat model, so Google had to figure out and apply a security patch. This is, for any organization facing a simular issue, a continual business-critical process, and one that could make or break a company's profitability (just ask anybody working on high-frequency trading). But not all companies deploy the same sort of detailed, continuous instrumentalization, and development and testing methodologies that they use to monitor and fix their software systems to their data driven algorithms independently of their implementations. The same data scientist who developed an algorithm is often in charge of monitoring its performance on a more or less regular basis; or, even worse, it's only a hit to business metrics what makes companies reassingn their scarce human resources towards figuring out what's going wrong. Either monitoring and maintenance strategy would amount to criminal malpractice if we were talking about software, yet there are companies for which is this is the norm.

Even more prevalent is the lack of automatic instrumentalization for algorithms mirroring that for servers. Any organization with a nontrivial infrastructure is well aware of, and has analysis tools and alarms for, things like server load or application errors. There are equivalent concepts for data-driven algorithms — quantitative statistical assumptions, wildly erroneous predictions — that should, also, be monitored in real time, and not collected (when the data is there) by a data scientist only after the situation has become bad enough to be noticed.

None of this is news to anybody working with big data, particularly in large organizations centered around this technology, but we have still to settle on a common set of technologies and practices, and even just on an universal agreement on its need.

These days nobody would dare deploy a web application trusting only server logs at the operating system level. Applications have their own semantics, after all, and everything in the operating system working perfectly is no guarantee that the app is working at all.

Large-scale prediction and optimization algorithms are just the same; they are often an abstraction running over the application software that implements them. They can be failing wildly, statistical assumptions unmet and parameters converging to implausible values, with nothing in the application layer logging even a warning of any kind.

Most users forgive a software bug much more easily than unintelligent behavior in avowedly intelligent software. As a culture, we're getting used to the fact that software fails, but many still buy the premise that artificial intelligence doesn't (this is contradictory, but so are all myths). Catching these errors as early as possible can only be done while algorithms are running in the real world, where the weird edge cases and the malicious users are, and this requires metrics, logs, and alarms that speak of what's going on in the world of mathematics, not software.

We haven't converged yet on a standard set of tools and practices for this, but I know many people who'll sleep easier once we have.

The future of machine learning lies in its (human) past

Superficially different in goals and approach, two recent algorithmic advances, Bayesian Program Learning and Galileo, are examples of one of the most interesting and powerful new trends in data analysis. It also happens to be the oldest one.

Bayesian Program Learning (BPL) is deservedly one of the most discussed modeling strategies of recent times, matching or outperforming both humans and deep learning models in one-shot handwritten character classification. Unlike many recent competitors, it's not a deep learning architecture. Rather (and very roughly) it understands handwritten characters as the output of stochastic programs that join together different graphical parts or concepts to generate versions of each character, and seeks to synthesize them by searching through the space of possible programs.

Galileo is, at first blush, a different beast. It's a system designed to extract physical information about the objects in an image or video (e.g., their movements), coupling a deep layer module with a 3D physics engine which acts as a generative model.

Although their domains and inferential algorithms are dissimilar, the common trait I want to emphasize is that they both have at their core domain-specific generative models that encode sophisticated a priori knowledge about the world. The BPL example knows implicitly, through the syntax and semantics of the language of its programs, that handwritten characters are drawn using one or more continuous strokes, often joined; an standard deep learning engine, beginning from scratch, would have to learn this. And Galileo leverages a proper, if simplified, 3D physics engine! It's not surprising that, together with superb design and engineering, these models show the performance they do.

This is how all cognitive processing tends to work in the wider world. We are fascinated, and of course how could we not be?, by how much our algorithms can learn from just raw data. To be able to obtain practical results in multiple domains is impressive, and adds to the (recent, and, like all such things, ephemeral) mystique of the data science industry. But the fact is that no successful cognitive entity starts from scratch: there is a lot about the world that's encoded in our physiology (we don't need to learn to pump our blood faster when we are scared; to say that evolution is a highly efficient massively parallel genetic algorithm is a bit of a joke, but also true, and what it has learned is encoded in whatever is alive, or it wouldn't be).

Going to the other end of the abstraction scale, for all of the fantastically powerful large-scale data analysis tools physicists use and in many cases depend on, the way even basic observations are understood is based on centuries of accumulated (or rather constantly refined) prior knowledge, encoded in specific notations, theories, and even theories about how theories can look like. Unlike most, although not all, industrial applications, data analysis in science isn't a replacement of explicitly codified abstract knowledge, but rather stands on its gigantic shoulders.

In parallel to continuous improvement in hardware, software engineering, and algorithms, we are going to see more and more often the deployment of prior domain knowledge as part of data science implementations. The logic is almost trivial: we have so much knowledge accumulated about so many things, that any implementation that doesn't leverage whatever is known in its domain is just not going to be competitive.

Just to be clear, this isn't a new thing, or a conceptual breakthrough. If anything, it predates the take the data and model it approach that's most popularly seen as "data science," and almost every practitioner, many of them coming from backgrounds in scientific research, is aware of it. It's simply that now our data analysis tools have become flexible and powerful for us to apply it with increasingly powerful results.

The difference in performance when this can be done, as I've seen in my own projects and is obvious in work like BPL and Galileo, has always been so decisive that doing things in any other way soon becomes indefensible except on grounds of expediency (unless of course you're working in a domain that lacks any meaningful theoretical knowledge... a possibility that usually leads to interesting conversations with the domain experts).

The cost is that it does shift significantly the way in which data scientists have to work. There are already plenty of challenges in dealing with the noise and complexities of raw data, before you start considering the ambiguities and difficulties of encoding and leveraging sometimes badly misspecified abstract theories. Teams become heterogeneous at a deeper level, with domain experts — many of them with no experience in this kind of task — not only validating the results and providing feedback, but participating actively as sources of knowledge from day one. Projects take longer. Theoretical assumptions in the domain become explicit, and therefore design discussions take much longer.

And so on and so forth.

That said, the results are very worth it. If data science is about leveraging the scientific method for data-driven decision-making, it behooves us to always remember that step zero of the scientific method is to get up to date, with some skepticism but with no less dedication, on everything your predecessors figured out.

Finding latent clusters of side effects

One of the interesting things about logical itemset mining, besides its conceptual simplicity, is the scope of potential applications. Besides the usual applications finding useful common sets of purchased goods or descriptive tags, the underlying idea of mixture-of, projections-of, latent [subsets] is a very powerful one (arguably, the reason why experiment design is so important and difficult is that most observations in the real world involve partial data from more than one simultaneous process or effect).

To play with this idea, I developed a quick-and-dirty implementation of the paper's algorithm, and applied it to the data set of the paper Predicting drug side-effect profiles: a chemical fragment-based approach. The data set includes 1385 different types of side effects potentially caused by 888 different drugs. The logical itemset mining algorithm quickly found the following latent groups of side effects:

  • hyponatremia, hyperkalemia, hypokalemia
  • impotence, decreased libido, gynecomastia
  • nightmares, psychosis, ataxia, hallucinations
  • neck rigidity, amblyopia, neck pain
  • visual field defect, eye pain, photophobia
  • rhinitis, pharyngitis, sinusitis, influenza, bronchitis

The groups seem reasonable enough (although hyperkalemia and hypokalemia being present in the same cluster is somewhat weird to my medically untrained eyes). Note the small size of the clusters and the specificity of the symptoms; most drugs induce fairly generic side effects, but the algorithm filters those out in a parametrically controlled way.

Bad guys, White Hat networks, and the Nuclear Switch

Welcome to Graph City (a random, connected, undirected graph), home of the Nuclear Switch (a distinguished node). Each one of Graph City's lawful citizens belongs to one of ten groups, characterized by their own stochastic movement patterns on the city. What they all have in common is that they never walk into the Nuclear Switch node.

This is because they are lawful, of course, and also because there's a White Hat network of government cameras monitoring some of the nodes in Graph City. They can't read citizen's thoughts (yet), but they know whether a citizen observed on a node is the same citizen that was observed on a different node a while ago, and with this information Graph City's government can build an statistical model of the movement of lawful citizens (as observed through the specific network of cameras).

This is what happens when random-walking, untrained bad guys (you know they are bad guys because they are capable of entering the Nuclear Switch node) start roaming the city (click to expand):

Attempts by untrained bad guys

Between half and twenty percent of the intrusion attempts succeed, depending on the total coverage of the White Hat Network (a coverage of 1.0 meaning that every node in the city has a camera linked to the system). This isn't acceptable performance in any real-life application, but this being a toy model with unrealistically small and simplified parameters, absolute performance numbers are rather meaningless.

Let's switch sides for a moment now, and advise the bad guys (after all, one person's Nuclear Switch is another's High Target Value, Market Influential, etc). An interesting first approach for bad guys would be to build a Black Hat Network, create their own model of lawful citizen's movements, and then use that systematically look for routes to the Nuclear Switch that won't trigger an alarm. The idea being, any person who looks innocent to the Black Hat Network's statistical model, will also pass unnoticed under the White Hat's.

This is what happens with bad guys trained using Black Hat Networks of different sizes are sent after the Nuclear Switch:

Attempts by bad guys trained on the BHN

Ouch. Some of the bad guys get to the Nuclear Switch on every try, but most of them are captured. A good metaphor for what's going on here could be that the White Hat Network and the Black Hat Network's filters are projections on orthogonal planes of a very high dimensional set of features. The set of possible behaviors for good and bad guys is very complex, so, unless your training set is comprehensive (something generally unfeasible), you can not have a filter that works very well on your training data and very poorly on a new observation — this is the bane of every overenthusiastic data analyst with a computer &mndash; but you can train two filters to detect the same subset of observations using the same training set, and have them be practically uncorrelated when it comes to new observations.

In our case, this is good news for Graph City's defenders, as even a huge Black Hat Network, and very well trained bad guys, are still vulnerable to the White Hat Network's statistical filter. It goes without saying, of course, that if the bad guys get even read-only access to the White Hat Network, Graph City is doomed.

Attempts by bad guys trained on the WHN

At one level, this is a trivial observation: if you have a good enough simulation of the target system, you can throw brute force at the simulation until you crack it, and then apply the solution to the real system with near total impunity (a caveat, though: in the real world, "good enough" simulations seldom are).

But, and this is something defenders tend to forget, bad guys don't need to hack into the White Hat Network. They can use Graph City as a model of itself (that's what the code I used above does), send dummy attackers, observe where they are captured, and keep refining their strategy. This is something already known to security analysts. Cf., e.g., Bruce Schneier — mass profiling doesn't work against a rational adversary, because it's too easy to adapt against. A White Hat Network could be (for the sake of argument) hack-proof, but it will still leak all critical information simply by the patter of alarms it raises. Security Through Alarms is hard!

As an aside, "Graph City" and the "Nuclear Switch" are merely narratively convenient labels. Consider graphs of financial transactions, drug traffic paths, information leakage channels, etc, and consider how many of our current enforcement strategies (or even laws) are predicated on the effectiveness of passive interdiction filters against rational coordinated adversaries...

The perfectly rational conspiracy theorist

Conspiracy theorists don't have a rationality problem, they have a priors problem, which is a different beast. Consider a rational person who believes in the existence of a powerful conspiracy, and then reads an arbitrary online article; we'll denote by C the propositions describing the conspiracy, and by a the propositions describing the article's content. By Bayes' theorem,

P(C|a) = \frac{P(a|C) P(C)}{P(a)}

Now, the key here is that the conspiracy is supposed to be powerful. A powerful enough conspiracy can make anything happen or look like it happened, and therefore it'll generally be the case that P(a|C) \geq P(a) (and usually P(a|C) > P(a) for low-probability a, of which there are many in these days, as Stanislaw Lem predicted in The Chain of Chance). But that means that in general P(C|a) \geq P(C), and often P(C|a) > P(C)! In other words, the rational evaluation of new evidence will seldom disprove a conspiracy theory, and will often reinforce its likelihood, and this isn't a rationality problem — even a perfect Bayesian reasoner will be trapped, once you get C into its priors (this is a well-known phenomenon in Bayesian inference; I like to think of these as black hole priors).

Keep an eye open, then, for those black holes. If you have a prior that no amount of evidence can weaken, then that's probably cause for concern, which is but another form of saying that you need to demand falsifiability in empirical statements. From non-refutable priors you can do mathematics or theology (both of which segue into poetry when you are doing them right), but not much else.