Category Archives: Math

Why we should always keep Shannon in mind

Sometimes there's no school like old school. A couple of weeks ago I spent some time working with data from GitHub Archive, trying to come up with a toy model to predict repo behavior based on previous actions (will it be forked? will there be a commit? etc). My first attempt was to do a sort of brute-force Hidden Markov Model, synthesizing states from the last k actions such that the graph of state-to-state transition was as nice as possible (ideally, low entropy of belonging to a state, high entropy for the next state conditional on knowing the current one). The idea was to do everything by hand, as a way to get more experience with Hy in a work-like project.

All of this was fun (and had me dealing, weirdly enough, with memory issues in Python, although those might have been indirectly caused by Hy), but was ultimately the wrong approach, because, as I realized way, way too late, what I really wanted to do was just to predict the next action given a sequence of actions, which is the classical problem of modeling non-random string sequences (just consider each action a character in a fixed alphabet).

So I facepalmed and repeated to myself one of those elegant bits of early 20th-century mathematics we use almost every day and forget even more often: modeling is prediction is compression is modeling. It's all, from the point of view of information theory, just a matter of perspective.

If you haven't been exposed to the relationship of compression and prediction before, here's a fun thought experiment: if you had a perfect/good enough predictive model of how something behaves, you would just need to show the initial state and say "and then it goes as predicted for the next 10 GB of data", and that would be that. Instant compression! Having a predictive model lets you compress, and inside every compression scheme there's a hidden predictive model (for true enlightenment, go to Shannon's paper, which is still worthy of being read almost 70 years later).

As a complementary example, what the venerable Lempel-Ziv-Welch ("zip") compression algorithm does is, handwaving away bookkeeping details, to build incrementally a dictionary of the most frequent substrings, making sure sure that those are assigned the shorter names in the "translated" version. By the obvious counting arguments, this means infrequent strings will get names that are longer than they are, but on average you gain space (how much? entropy much!). But this also lets you build a barebones predictive model: given the dictionary of frequent substrings that the algorithm has built so far, look at your past history, see which frequent substrings extend your recent past, and assume one of them is going to happen — essentially, your prediction is "whatever would make for a shorted compressed version", which you know is a good strategy in general, because compressed versions do tend to be shorter.

So I implemented the core of a zip encoder in Hy, and then used it to predict github behavior. It's primitive, of course, and the performance was nothing to write a post about (which is why this post isn't called A predictive model of github behavior), but on the other hand, it's an extremely fast streaming predictive algorithm that requires zero configuration. Nothing I would use in a job &mdahs; you can get much better performance with more complex models, which are also the kind you get paid for — but it was educative to encounter a forceful reminder of the underlying mathematical unity of information theory.

In a world of multi-warehouse-scale computers and mind-bendingly complex inferential algorithms, it's good to remember where it all comes from.

The nominalist trap in Big Data analysis

Nominalism, formerly the novelty of a few, wrote Jorge Luis Borges, today embraces all people; its victory is so vast and fundamental that its name is useless. Nobody declares himself nominalist because there is nobody who is anything else. He didn't go on to write This is why even successful Big Data projects often fail to have an impact (except in some volumes kept in the Library of Babel), but his understandable omission doesn't make the diagnosis any less true.

Nominalism, to oversimplify the concept enough for the case at hand, is simply the assumption that just because there are many things in our world which we call chairs, that doesn't imply that the concept itself of a chair is real in a concrete sense, that there is an Ultimate, Really-Real Chair, perhaps standing in front of an Ultimate Table. We have things we call chairs, and we have the word "chair", and those are enough to furnish our houses and our minds, even if some carpenters still toss around at night, haunted by half-glimpses of an ideal one.

It has become a commonplace, quite successful way of thinking, so it's natural for it to be the basis of what's perhaps the "standard" approach to Big Data analysis. Names, numbers, and symbols are loaded into computers (account identifiers, action counters, times, dates, coordinates, prices, numbers, labels of all kinds), and then they are obsessively processed in an almost cabalistic way, organizing and re-organizing them in order to find and clarify whatever mathematical structure, and perhaps explanatory or even predictive power, they might have — and all of this data manipulation, by and large, takes place as if nothing were real but the relationships between the symbols, the data schemas and statistical correlations. Let's not blame the computers for it: they do work in Platonic caves filled with bits, with further bits being the only way in which they can receive news from the outside world.

This works quite well; well enough, in fact, to make Big Data a huge industry with widespread economic and, increasingly, political impact, but it can also fail in very drastic yet dangerously understated ways. Because, you see, from the point of view of algorithms, there *are* such things as Platonic ideals — us. Account 3788 is a reference to a real person (or a real dog, or a real corporation, or a real piece of land, or a real virus) and although we cannot right now put all of the relevant information about that person in a file, and associate it with the account number, that information, the fact of its being a person represented by a data vector, rather than a data vector, makes all the difference between the merely mathematically sophisticated analyst and the effective one. Properly performed, data analysis is the application of inferential mathematics to abstract data, together with the constant awareness and suspicion of the reality the data describes, and what this gap, all the Unrecorded bits, might mean for the problem at hand.

Massive multi-user games have failed because their strategic analysis confused the player-in-the-computer (who sought, say, silver) with the player-in-the-real-world (who sought fun, and cared for silver only insofar as that was fun). Technically flawless recommendation engines sometimes have no effect on user behavior, because even the best items were just boring to begin with. Once, I spent an hour trying to understand a sudden drop in the usage of a certain application in some countries but not in others, until I realized that it was Ramadan, and those countries were busy celebrating it.

Software programmers have to be nominalists — it's the pleasure and the privilege of coders to work, generally and as much as possible, in symbolic universes of self-contained elegance — and mathematicians are basically dedicated to the game of finding out how much truth can be gotten just from the symbols themselves. Being a bit of both, data analysts are very prone to lose themselves in the game of numbers, algorithms, and code. The trick is to be able to do so while also remembering that it's a lie — we might aim at having in our models as much of the complexity of the world as possible, but there's always (so far?) much more left outside, and it's part of the work of the analyst, perhaps her primary epistemological duty, to be alert to this, to understand how the Unrecorded might be the most important part of what she's trying to understand, and to be always open and eager to expand the model to embrace yet another aspect of the world.

The consequences of not doing this can be more than technical or economic. Contemporary civilization is impossible without the use of abstract data to understand and organize people, but the most terrible forms of contemporary barbarism, at the most demencial scales, would be impossible without the deliberate forgetfulness of the reality behind the data.

A short note to myself on Propp-Wilson sampling

Most of the explanations I've read of Propp-Wilson sampling describe the method in terms of "sampling from the past," in order to make sense of the fact that you get your random numbers before attempting to obtain a sample from the target distribution, and don't re-sample them until you succeed (hence the way the Markov chain is grown from t_{-k} to t_0).

I find it more intuitive to think of this in terms of "sampling from deterministic universes." The basic hand-waving intuition is that instead of a non-deterministic system, you are sampling from a probabilistic ensemble of fully deterministic systems, so you first a) select the deterministic system (that is, the infinite series of random numbers you'll use to walk through the Markov chain), and b) run it until its story doesn't depend on the choice of original state. The result of this procedure will be a sample from the exact equilibrium distribution (because you have sampled from or "burned off" the two sources of distortion from this equilibrium distribution, the non-deterministic nature of the system and the dependence on the original state).

As I said, I think this is mathematically equivalent to Propp-Wilson sampling, although you'd have to tweak a bit the proofs. But it feels more understandable to me than other arguments I've read, so at least it has that benefit (assuming, of course, it's true).

PS: On the other hand "sampling from the past" is too fascinating a turn of phrase not to use, so I can see the temptation.

The perfectly rational conspiracy theorist

Conspiracy theorists don't have a rationality problem, they have a priors problem, which is a different beast. Consider a rational person who believes in the existence of a powerful conspiracy, and then reads an arbitrary online article; we'll denote by C the propositions describing the conspiracy, and by a the propositions describing the article's content. By Bayes' theorem,

P(C|a) = \frac{P(a|C) P(C)}{P(a)}

Now, the key here is that the conspiracy is supposed to be powerful. A powerful enough conspiracy can make anything happen or look like it happened, and therefore it'll generally be the case that P(a|C) \geq P(a) (and usually P(a|C) > P(a) for low-probability a, of which there are many in these days, as Stanislaw Lem predicted in The Chain of Chance). But that means that in general P(C|a) \geq P(C), and often P(C|a) > P(C)! In other words, the rational evaluation of new evidence will seldom disprove a conspiracy theory, and will often reinforce its likelihood, and this isn't a rationality problem — even a perfect Bayesian reasoner will be trapped, once you get C into its priors (this is a well-known phenomenon in Bayesian inference; I like to think of these as black hole priors).

Keep an eye open, then, for those black holes. If you have a prior that no amount of evidence can weaken, then that's probably cause for concern, which is but another form of saying that you need to demand falsifiability in empirical statements. From non-refutable priors you can do mathematics or theology (both of which segue into poetry when you are doing them right), but not much else.

Fractals are unfair

Let's say you want to identify an arbitrary point in a segment I (chosen with an uniform probability distribution). A more or less canonical way to do this is to split the segment in two equal halves, and write down a bit identifying which half; now the size of the set where the point is hidden is one half of what it was. Because the half-segment we chose is affinely equivalent to the original one, we can repeat this as much as we want, gaining one bit of precision (halving the size of the "it's somewhere around here" set) for each bit of description. Seems fair.

It's easy to do the same on a square I in R^2. Split the square in four squares, write down two bits to identify the one where the point is, repeat at will. Because each square has one fourth the size of the enclosing one, you gain two bits of precision for each two bits of description. Still fair (and we cannot do better).

Now try to do this on a fractal, say Koch's curve, and things go as weird as you'd think. You can always split it in four affinely equivalent pieces, but each of them is one-third the size of the original one, which means that you gain less than two bits of precision for each two bits of description. Now this is unfair. A fractal street would be a very good way of putting in an infinite number of houses inside a finite downtown, but (even approximate) driving directions will be longer than you'd think they should.

Of course, this is merely a paraphrasing of the usual definition of a fractal, which is an object whose fractal dimension exceeds its topological dimension (very hand-wavingly speaking, the number of bits of description in each step is higher than the number of bits of precision that you gain). But I do enjoy looking at things through an information theory lens.

Besides, there's a (still handwavy but clear) connection here with chaos, through the idea of trying to pin down the future of a system in phase space by pinning down its present. In conservative systems this is fair: one bit of precision in the present, gives you one bit of precision for the future (after all, volumes are preserved). But when chaos is involved this is no longer the case! For any fixed horizon, you need to put in a more bits of information about the present in order to get the same number of bits about the future.