The future of machine learning lies in its (human) past

2015-12-17

Superficially different in goals and approach, two recent algorithmic advances, Bayesian Program Learning and Galileo, are examples of one of the most interesting and powerful new trends in data analysis. It also happens to be the oldest one.

Bayesian Program Learning (BPL) is deservedly one of the most discussed modeling strategies of recent times, matching or outperforming both humans and deep learning models in one-shot handwritten character classification. Unlike many recent competitors, it's not a deep learning architecture. Rather (and very roughly) it understands handwritten characters as the output of stochastic programs that join together different graphical parts or concepts to generate versions of each character, and seeks to synthesize them by searching through the space of possible programs.

Galileo is, at first blush, a different beast. It's a system designed to extract physical information about the objects in an image or video (e.g., their movements), coupling a deep layer module with a 3D physics engine which acts as a generative model.

Although their domains and inferential algorithms are dissimilar, the common trait I want to emphasize is that they both have at their core domain-specific generative models that encode sophisticated a priori knowledge about the world. The BPL example knows implicitly, through the syntax and semantics of the language of its programs, that handwritten characters are drawn using one or more continuous strokes, often joined; an standard deep learning engine, beginning from scratch, would have to learn this. And Galileo leverages a proper, if simplified, 3D physics engine! It's not surprising that, together with superb design and engineering, these models show the performance they do.

This is how all cognitive processing tends to work in the wider world. We are fascinated, and of course how could we not be?, by how much our algorithms can learn from just raw data. To be able to obtain practical results in multiple domains is impressive, and adds to the (recent, and, like all such things, ephemeral) mystique of the data science industry. But the fact is that no successful cognitive entity starts from scratch: there is a lot about the world that's encoded in our physiology (we don't need to learn to pump our blood faster when we are scared; to say that evolution is a highly efficient massively parallel genetic algorithm is a bit of a joke, but also true, and what it has learned is encoded in whatever is alive, or it wouldn't be).

Going to the other end of the abstraction scale, for all of the fantastically powerful large-scale data analysis tools physicists use and in many cases depend on, the way even basic observations are understood is based on centuries of accumulated (or rather constantly refined) prior knowledge, encoded in specific notations, theories, and even theories about how theories can look like. Unlike most, although not all, industrial applications, data analysis in science isn't a replacement of explicitly codified abstract knowledge, but rather stands on its gigantic shoulders.

In parallel to continuous improvement in hardware, software engineering, and algorithms, we are going to see more and more often the deployment of prior domain knowledge as part of data science implementations. The logic is almost trivial: we have so much knowledge accumulated about so many things, that any implementation that doesn't leverage whatever is known in its domain is just not going to be competitive.

Just to be clear, this isn't a new thing, or a conceptual breakthrough. If anything, it predates the take the data and model it approach that's most popularly seen as "data science," and almost every practitioner, many of them coming from backgrounds in scientific research, is aware of it. It's simply that now our data analysis tools have become flexible and powerful for us to apply it with increasingly powerful results.

The difference in performance when this can be done, as I've seen in my own projects and is obvious in work like BPL and Galileo, has always been so decisive that doing things in any other way soon becomes indefensible except on grounds of expediency (unless of course you're working in a domain that lacks any meaningful theoretical knowledge... a possibility that usually leads to interesting conversations with the domain experts).

The cost is that it does shift significantly the way in which data scientists have to work. There are already plenty of challenges in dealing with the noise and complexities of raw data, before you start considering the ambiguities and difficulties of encoding and leveraging sometimes badly misspecified abstract theories. Teams become heterogeneous at a deeper level, with domain experts — many of them with no experience in this kind of task — not only validating the results and providing feedback, but participating actively as sources of knowledge from day one. Projects take longer. Theoretical assumptions in the domain become explicit, and therefore design discussions take much longer.

And so on and so forth.

That said, the results are very worth it. If data science is about leveraging the scientific method for data-driven decision-making, it behooves us to always remember that step zero of the scientific method is to get up to date, with some skepticism but with no less dedication, on everything your predecessors figured out.