Category Archives: Inference

How to be data-driven without data...

...and then make better use of the data you get.

The usefulness of data science begins long before you collect the first data point. It can be used to describe very clearly your questions and your assumptions, and to analyze in a consistent manner what they imply. This is neither a simple exercise nor an academic one: informal approaches are notoriously bad at handling the interplay of complex probabilities, yet even the a priori knowledge embedded in personal experience and publicly available research, when properly organized and queried, can answer many questions that mass quantities of data, processed carelessly, wouldn't be able to, as well as suggest what measurements should be attempted first, and what for.

The larger the gap between the complexity of a system and the existing data capture and analysis infrastructure, the more important it is to set up initial data-free (which doesn't mean knowledge-free) formal models as a temporary bridge between both. Toy models are a good way to begin this approach; as the British statistician George E.P. Box wrote, all models are wrong, but some are useful (at least for a while, we might add, but that's as much as we can ask of any tool).

Let's say you're evaluating an idea for a new network-like service for specialized peer-to-peer consulting that will have the possibility of monetizing a certain percentage of the interactions between users. You will, of course, capture all of the relevant information once the network is running — and there's no substitute for real data — but that doesn't mean you have to wait until then to start thinking about it as a data scientist, which in this context means probabilistically.

Note that the following numbers are wrong: it takes research, experience, and time to figure out useful guesses. What matters for the purposes of this post is describing the process, oversimplified as it will be.

You don't know a priori how large the network will be after, say, one year, but you can look at other competitors, the size of the relevant market, and so on, and guess, not a number ("our network in one year will have a hundred thousand users"), but the relative likelihood of different values.

The graph above shows one possible set of guesses. Instead of giving a single number, it "says" that there's a 50% chance that the network will have at least a hundred thousand users, and a 5.4% chance that it'll have at least half a million (although note that decimals points in this context are rather pointless; a guess based on experience and research can be extremely useful, but will rarely be this precise). On the other hand, there's almost a 25% chance that the network will have less than fifty thousand users, and a 10% chance that it'll have less than twenty-eight thousand.

How do you build such a graph, or rather, how do you assemble the information represented on it? The answer will probably look surprisingly old-fashioned: by learning as much as you can about the topic, talking with people who know about it, exercising your judgment, and then using formal mathematics to force yourself to write your best guess in a way that's explicitly clear about what it says and what it doesn't. The first steps are things you were already doing to help you with your problem, but the last one is what will allow you to coordinate knowledge and experience from different sources to give you the best possible answer to your question, given whatever you know at that moment.

You can use the same process to codify your educated guesses about other key aspects of the application, like the rate at which members of the network will interact, and the average revenue you'll be able to get from each interaction. As always, neither these numbers nor the specific shape of the curves matter for this toy example, but note how different degrees and forms of uncertainty are represented through different types of probability distributions:

Clearly, in this toy model we're sure about some things like the interaction rate (measured, say, in interactions per month), and very unsure about others, like the average revenue per interaction. Thinking about the implications of multiple uncertainties is one of the toughest cognitive challenges, as humans tend to conceptualize specific concrete scenarios: we think in terms of one or at best a couple of states of the world we expect to happen, but when there are multiple interacting variables, even the most likely scenario might have a very low absolute probability.

Simulation software, though, makes this nearly trivial even for the most complex models. Here's, for example, the distribution of probabilities for the monthly revenue, as necessarily implied by our assumptions about the other variables:

There are scenarios where your revenue is more than USD 10M per month, and you're of course free to choose the other variables so this is one of the handful of specific scenarios you describe (perhaps the most common and powerful of the ways in which people pitching a product or idea exploit the biases and limitations in human cognition). But doing this sort of quantitative analysis forces you to be honest at least to yourself: if what you know and don't know is described by the distributions above, then you aren't free to tell yourself that your chance of hitting it big is other than microscopic, no matter how clear the image might be in your mind.

That said, not getting USD 10M a month doesn't mean the idea is worthless; maybe you can break even and then use that time to pivot or sell it, or you just want to create something that works and is useful, and then grow it over time. Either way, let's assume your total costs are expected to be USD 200k per month (if this were a proper analysis and not a toy example, this wouldn't be an specific guess, but another probability distribution based on educated guesses, expert opinions, market surveys, etc). How do probabilities look then?

You can answer this question using the same sort of analysis:

The inescapable consequence of your assumptions is that your chances of breaking even are 1 in 20. Can they be improved? One advantage of fully explicit models is that you can ask not just for the probability of something happening, but also about how things depend on each other.

Here are the relationships between the revenue, according to the model, and each of the main variables, with a linear best fit approximation superimposed:

As you can see, network size has the clearest relationship with revenue. This might look strange – wouldn't, under this kind of simple model, multiplying by ten the number of interactions keeping the monetization rate also multiply by ten the revenue? Yes, but your assumptions say you can't multiply the number of interactions by more than a factor of five, which, together with your other assumptions, isn't enough to move your revenue very far. So it isn't that it's unreasonable to consider the option of increasing interactions significantly, to improve your chances of breaking even (or even getting to USD 10M). But if you plan to increase outside the explicit range encoded your assumptions, you have to explain why they were wrong. Always be careful when you do this: changing your assumptions to make possible something that would be useful if it were possible is one of humankind's favorite ways of driving directly into blind alleys at high speed.

It's key to understand that none of this is really a prediction about the future. Statistical analysis doesn't really deal with predicting the future or even getting information about the present: it's all about clarifying the implications of your observations and assumptions. It's your job to make those observations and assumptions as good and releevant as possible, both not leaving out anything you know, and not pretending you know what you don't, or that your are more certain about something that you should be.

This problem is somewhat mitigated for domains where we have vast amounts of information, including, recently, areas like computer vision and robotics. But we have yet to achieve the same level of data collection in other key areas like business strategy, so there's no way of avoiding using expert knowledge... which doesn't mean, as we saw, that we have to ditch quantitative methods.

Ultimately, successful organizations do the entire spectrum of analysis activities: they build high-level explicit models, encode expert knowledge, collect as much high-quality data as possible, train machine learning models based on that, and exploit all of that for strategic analysis, automatization, predictive modeling, etc. There are no silver bullets, but you probably have more ammunition than you think.

The job of the future isn't creating artificial intelligences, but keeping them sane

Once upon a time, we thought there was such a thing as bug-free programming. Some organizations still do — and woe betide their customers — but after a few decades hitting that particular wall, the profession has by and large accepted that writing software is such an extremely complex intellectual endeavor that errors and unfounded assumptions are unavoidable. Even the most mathematically solid of formal methods has, if nothing else, to interact with a world of unstable platforms and unreliable humans, and what worked today will fail tomorrow.

So we spend time and resources maintaining what we already "finished," fixing bugs as they are found, and adapting programs to new realities as they develop. We have to, because when we don't, as when physical infrastructure isn't maintained, we save resources in the short term, but only in our way towards protracted ruin.

It's no surprise that this also happens with our most sofisticated data-driven algorithms. CVs and scrum boards are filled with references to the maintenance of this or that prediction or optimization algorithm.

But there's a subtle, not universal but still very prevalent, problem: those aren't software bugs. This isn't to say that implementations don't have bugs; being software, they do. But they are computer programs implementing inference algorithms, which work at a higher level of abstraction, and those have their own kinds of bugs, and those don't leave stack traces behind.

A clear example is the experience of Google. PageRank was, without a doubt, among the most influential algorithms in the history of the internet, not to mention the most profitable, but as Google took the internet over by storm, gaming PageRank became such an important business activity that "SEO" became a commonplace word.

From an algorithmic point of view this simply a maintenance problem: PageRank assumed a certain relationship between link structure and relevance, based on the assumption that website creators weren't trying to fool it. Once this assumption became untenable, the algorithm had to be modified to cope with a world of link farms and text written with no human reader in mind.

In (very loosely equivalent) software terms, there was a new threat model, so Google had to figure out and apply a security patch. This is, for any organization facing a simular issue, a continual business-critical process, and one that could make or break a company's profitability (just ask anybody working on high-frequency trading). But not all companies deploy the same sort of detailed, continuous instrumentalization, and development and testing methodologies that they use to monitor and fix their software systems to their data driven algorithms independently of their implementations. The same data scientist who developed an algorithm is often in charge of monitoring its performance on a more or less regular basis; or, even worse, it's only a hit to business metrics what makes companies reassingn their scarce human resources towards figuring out what's going wrong. Either monitoring and maintenance strategy would amount to criminal malpractice if we were talking about software, yet there are companies for which is this is the norm.

Even more prevalent is the lack of automatic instrumentalization for algorithms mirroring that for servers. Any organization with a nontrivial infrastructure is well aware of, and has analysis tools and alarms for, things like server load or application errors. There are equivalent concepts for data-driven algorithms — quantitative statistical assumptions, wildly erroneous predictions — that should, also, be monitored in real time, and not collected (when the data is there) by a data scientist only after the situation has become bad enough to be noticed.

None of this is news to anybody working with big data, particularly in large organizations centered around this technology, but we have still to settle on a common set of technologies and practices, and even just on an universal agreement on its need.

These days nobody would dare deploy a web application trusting only server logs at the operating system level. Applications have their own semantics, after all, and everything in the operating system working perfectly is no guarantee that the app is working at all.

Large-scale prediction and optimization algorithms are just the same; they are often an abstraction running over the application software that implements them. They can be failing wildly, statistical assumptions unmet and parameters converging to implausible values, with nothing in the application layer logging even a warning of any kind.

Most users forgive a software bug much more easily than unintelligent behavior in avowedly intelligent software. As a culture, we're getting used to the fact that software fails, but many still buy the premise that artificial intelligence doesn't (this is contradictory, but so are all myths). Catching these errors as early as possible can only be done while algorithms are running in the real world, where the weird edge cases and the malicious users are, and this requires metrics, logs, and alarms that speak of what's going on in the world of mathematics, not software.

We haven't converged yet on a standard set of tools and practices for this, but I know many people who'll sleep easier once we have.

The future of machine learning lies in its (human) past

Superficially different in goals and approach, two recent algorithmic advances, Bayesian Program Learning and Galileo, are examples of one of the most interesting and powerful new trends in data analysis. It also happens to be the oldest one.

Bayesian Program Learning (BPL) is deservedly one of the most discussed modeling strategies of recent times, matching or outperforming both humans and deep learning models in one-shot handwritten character classification. Unlike many recent competitors, it's not a deep learning architecture. Rather (and very roughly) it understands handwritten characters as the output of stochastic programs that join together different graphical parts or concepts to generate versions of each character, and seeks to synthesize them by searching through the space of possible programs.

Galileo is, at first blush, a different beast. It's a system designed to extract physical information about the objects in an image or video (e.g., their movements), coupling a deep layer module with a 3D physics engine which acts as a generative model.

Although their domains and inferential algorithms are dissimilar, the common trait I want to emphasize is that they both have at their core domain-specific generative models that encode sophisticated a priori knowledge about the world. The BPL example knows implicitly, through the syntax and semantics of the language of its programs, that handwritten characters are drawn using one or more continuous strokes, often joined; an standard deep learning engine, beginning from scratch, would have to learn this. And Galileo leverages a proper, if simplified, 3D physics engine! It's not surprising that, together with superb design and engineering, these models show the performance they do.

This is how all cognitive processing tends to work in the wider world. We are fascinated, and of course how could we not be?, by how much our algorithms can learn from just raw data. To be able to obtain practical results in multiple domains is impressive, and adds to the (recent, and, like all such things, ephemeral) mystique of the data science industry. But the fact is that no successful cognitive entity starts from scratch: there is a lot about the world that's encoded in our physiology (we don't need to learn to pump our blood faster when we are scared; to say that evolution is a highly efficient massively parallel genetic algorithm is a bit of a joke, but also true, and what it has learned is encoded in whatever is alive, or it wouldn't be).

Going to the other end of the abstraction scale, for all of the fantastically powerful large-scale data analysis tools physicists use and in many cases depend on, the way even basic observations are understood is based on centuries of accumulated (or rather constantly refined) prior knowledge, encoded in specific notations, theories, and even theories about how theories can look like. Unlike most, although not all, industrial applications, data analysis in science isn't a replacement of explicitly codified abstract knowledge, but rather stands on its gigantic shoulders.

In parallel to continuous improvement in hardware, software engineering, and algorithms, we are going to see more and more often the deployment of prior domain knowledge as part of data science implementations. The logic is almost trivial: we have so much knowledge accumulated about so many things, that any implementation that doesn't leverage whatever is known in its domain is just not going to be competitive.

Just to be clear, this isn't a new thing, or a conceptual breakthrough. If anything, it predates the take the data and model it approach that's most popularly seen as "data science," and almost every practitioner, many of them coming from backgrounds in scientific research, is aware of it. It's simply that now our data analysis tools have become flexible and powerful for us to apply it with increasingly powerful results.

The difference in performance when this can be done, as I've seen in my own projects and is obvious in work like BPL and Galileo, has always been so decisive that doing things in any other way soon becomes indefensible except on grounds of expediency (unless of course you're working in a domain that lacks any meaningful theoretical knowledge... a possibility that usually leads to interesting conversations with the domain experts).

The cost is that it does shift significantly the way in which data scientists have to work. There are already plenty of challenges in dealing with the noise and complexities of raw data, before you start considering the ambiguities and difficulties of encoding and leveraging sometimes badly misspecified abstract theories. Teams become heterogeneous at a deeper level, with domain experts — many of them with no experience in this kind of task — not only validating the results and providing feedback, but participating actively as sources of knowledge from day one. Projects take longer. Theoretical assumptions in the domain become explicit, and therefore design discussions take much longer.

And so on and so forth.

That said, the results are very worth it. If data science is about leveraging the scientific method for data-driven decision-making, it behooves us to always remember that step zero of the scientific method is to get up to date, with some skepticism but with no less dedication, on everything your predecessors figured out.

Finding latent clusters of side effects

One of the interesting things about logical itemset mining, besides its conceptual simplicity, is the scope of potential applications. Besides the usual applications finding useful common sets of purchased goods or descriptive tags, the underlying idea of mixture-of, projections-of, latent [subsets] is a very powerful one (arguably, the reason why experiment design is so important and difficult is that most observations in the real world involve partial data from more than one simultaneous process or effect).

To play with this idea, I developed a quick-and-dirty implementation of the paper's algorithm, and applied it to the data set of the paper Predicting drug side-effect profiles: a chemical fragment-based approach. The data set includes 1385 different types of side effects potentially caused by 888 different drugs. The logical itemset mining algorithm quickly found the following latent groups of side effects:

  • hyponatremia, hyperkalemia, hypokalemia
  • impotence, decreased libido, gynecomastia
  • nightmares, psychosis, ataxia, hallucinations
  • neck rigidity, amblyopia, neck pain
  • visual field defect, eye pain, photophobia
  • rhinitis, pharyngitis, sinusitis, influenza, bronchitis

The groups seem reasonable enough (although hyperkalemia and hypokalemia being present in the same cluster is somewhat weird to my medically untrained eyes). Note the small size of the clusters and the specificity of the symptoms; most drugs induce fairly generic side effects, but the algorithm filters those out in a parametrically controlled way.

Bad guys, White Hat networks, and the Nuclear Switch

Welcome to Graph City (a random, connected, undirected graph), home of the Nuclear Switch (a distinguished node). Each one of Graph City's lawful citizens belongs to one of ten groups, characterized by their own stochastic movement patterns on the city. What they all have in common is that they never walk into the Nuclear Switch node.

This is because they are lawful, of course, and also because there's a White Hat network of government cameras monitoring some of the nodes in Graph City. They can't read citizen's thoughts (yet), but they know whether a citizen observed on a node is the same citizen that was observed on a different node a while ago, and with this information Graph City's government can build an statistical model of the movement of lawful citizens (as observed through the specific network of cameras).

This is what happens when random-walking, untrained bad guys (you know they are bad guys because they are capable of entering the Nuclear Switch node) start roaming the city (click to expand):

Attempts by untrained bad guys

Between half and twenty percent of the intrusion attempts succeed, depending on the total coverage of the White Hat Network (a coverage of 1.0 meaning that every node in the city has a camera linked to the system). This isn't acceptable performance in any real-life application, but this being a toy model with unrealistically small and simplified parameters, absolute performance numbers are rather meaningless.

Let's switch sides for a moment now, and advise the bad guys (after all, one person's Nuclear Switch is another's High Target Value, Market Influential, etc). An interesting first approach for bad guys would be to build a Black Hat Network, create their own model of lawful citizen's movements, and then use that systematically look for routes to the Nuclear Switch that won't trigger an alarm. The idea being, any person who looks innocent to the Black Hat Network's statistical model, will also pass unnoticed under the White Hat's.

This is what happens with bad guys trained using Black Hat Networks of different sizes are sent after the Nuclear Switch:

Attempts by bad guys trained on the BHN

Ouch. Some of the bad guys get to the Nuclear Switch on every try, but most of them are captured. A good metaphor for what's going on here could be that the White Hat Network and the Black Hat Network's filters are projections on orthogonal planes of a very high dimensional set of features. The set of possible behaviors for good and bad guys is very complex, so, unless your training set is comprehensive (something generally unfeasible), you can not have a filter that works very well on your training data and very poorly on a new observation — this is the bane of every overenthusiastic data analyst with a computer &mndash; but you can train two filters to detect the same subset of observations using the same training set, and have them be practically uncorrelated when it comes to new observations.

In our case, this is good news for Graph City's defenders, as even a huge Black Hat Network, and very well trained bad guys, are still vulnerable to the White Hat Network's statistical filter. It goes without saying, of course, that if the bad guys get even read-only access to the White Hat Network, Graph City is doomed.

Attempts by bad guys trained on the WHN

At one level, this is a trivial observation: if you have a good enough simulation of the target system, you can throw brute force at the simulation until you crack it, and then apply the solution to the real system with near total impunity (a caveat, though: in the real world, "good enough" simulations seldom are).

But, and this is something defenders tend to forget, bad guys don't need to hack into the White Hat Network. They can use Graph City as a model of itself (that's what the code I used above does), send dummy attackers, observe where they are captured, and keep refining their strategy. This is something already known to security analysts. Cf., e.g., Bruce Schneier — mass profiling doesn't work against a rational adversary, because it's too easy to adapt against. A White Hat Network could be (for the sake of argument) hack-proof, but it will still leak all critical information simply by the patter of alarms it raises. Security Through Alarms is hard!

As an aside, "Graph City" and the "Nuclear Switch" are merely narratively convenient labels. Consider graphs of financial transactions, drug traffic paths, information leakage channels, etc, and consider how many of our current enforcement strategies (or even laws) are predicated on the effectiveness of passive interdiction filters against rational coordinated adversaries...

The perfectly rational conspiracy theorist

Conspiracy theorists don't have a rationality problem, they have a priors problem, which is a different beast. Consider a rational person who believes in the existence of a powerful conspiracy, and then reads an arbitrary online article; we'll denote by C the propositions describing the conspiracy, and by a the propositions describing the article's content. By Bayes' theorem,

P(C|a) = \frac{P(a|C) P(C)}{P(a)}

Now, the key here is that the conspiracy is supposed to be powerful. A powerful enough conspiracy can make anything happen or look like it happened, and therefore it'll generally be the case that P(a|C) \geq P(a) (and usually P(a|C) > P(a) for low-probability a, of which there are many in these days, as Stanislaw Lem predicted in The Chain of Chance). But that means that in general P(C|a) \geq P(C), and often P(C|a) > P(C)! In other words, the rational evaluation of new evidence will seldom disprove a conspiracy theory, and will often reinforce its likelihood, and this isn't a rationality problem — even a perfect Bayesian reasoner will be trapped, once you get C into its priors (this is a well-known phenomenon in Bayesian inference; I like to think of these as black hole priors).

Keep an eye open, then, for those black holes. If you have a prior that no amount of evidence can weaken, then that's probably cause for concern, which is but another form of saying that you need to demand falsifiability in empirical statements. From non-refutable priors you can do mathematics or theology (both of which segue into poetry when you are doing them right), but not much else.