Why we should always keep Shannon in mind

Sometimes there’s no school like old school. A couple of weeks ago I spent some time working with data from GitHub Archive, trying to come up with a toy model to predict repo behavior based on previous actions (will it be forked? will there be a commit? etc). My first attempt was to do a sort of brute-force Hidden Markov Model, synthesizing states from the last k actions such that the graph of state-to-state transition was as nice as possible (ideally, low entropy of belonging to a state, high entropy for the next state conditional on knowing the current one). The idea was to do everything by hand, as a way to get more experience with Hy in a work-like project.

All of this was fun (and had me dealing, weirdly enough, with memory issues in Python, although those might have been indirectly caused by Hy), but was ultimately the wrong approach, because, as I realized way, way too late, what I really wanted to do was just to predict the next action given a sequence of actions, which is the classical problem of modeling non-random string sequences (just consider each action a character in a fixed alphabet).

So I facepalmed and repeated to myself one of those elegant bits of early 20th-century mathematics we use almost every day and forget even more often: modeling is prediction is compression is modeling. It’s all, from the point of view of information theory, just a matter of perspective.

If you haven’t been exposed to the relationship of compression and prediction before, here’s a fun thought experiment: if you had a perfect/good enough predictive model of how something behaves, you would just need to show the initial state and say “and then it goes as predicted for the next 10 GB of data”, and that would be that. Instant compression! Having a predictive model lets you compress, and inside every compression scheme there’s a hidden predictive model (for true enlightenment, go to Shannon’s paper, which is still worthy of being read almost 70 years later).

As a complementary example, what the venerable Lempel-Ziv-Welch (“zip”) compression algorithm does is, handwaving away bookkeeping details, to build incrementally a dictionary of the most frequent substrings, making sure sure that those are assigned the shorter names in the “translated” version. By the obvious counting arguments, this means infrequent strings will get names that are longer than they are, but on average you gain space (how much? entropy much!). But this also lets you build a barebones predictive model: given the dictionary of frequent substrings that the algorithm has built so far, look at your past history, see which frequent substrings extend your recent past, and assume one of them is going to happen — essentially, your prediction is “whatever would make for a shorted compressed version”, which you know is a good strategy in general, because compressed versions do tend to be shorter.

So I implemented the core of a zip encoder in Hy, and then used it to predict github behavior. It’s primitive, of course, and the performance was nothing to write a post about (which is why this post isn’t called A predictive model of github behavior), but on the other hand, it’s an extremely fast streaming predictive algorithm that requires zero configuration. Nothing I would use in a job &mdahs; you can get much better performance with more complex models, which are also the kind you get paid for — but it was educative to encounter a forceful reminder of the underlying mathematical unity of information theory.

In a world of multi-warehouse-scale computers and mind-bendingly complex inferential algorithms, it’s good to remember where it all comes from.