Probability-as-logic vs probability-as-strategy vs probability-as-measure-theory

Attention conservation notice: Elementary (and possibly not-even-right) if you have the relevant mathematical background, pointless if you don't. Written to help me clarify to myself a moment of categorical (pun not intended) confusion.

What's a possible way to understand the relationship between probability as a the (by Cox) extension of classical logic, probability as an optimal way to make decisions, and probability in the frequentist usage? Not in any deep philosophical sense, just in terms of pragmatics.

I like to begin from the Bayes/Jaynes/Cox view: if you take classical logic as valid (which I do in daily life) and want to extend it in a consistent way to continuous logic values (which I also do), then you end up with continuous logic/certainty values we unfortunately call probability due to historical reasons.

Perhaps surprisingly, its relationship with frequentist probability isn't necessarily contentious. You can take the Kolmogorov axioms as, roughly speaking, helping you define a sort of functor (awfully, based on shared notation and vocabulary, an observation that made me shudder a bit — it's almost magical thinking) between the place where you do probability-as-logic and a place where you can exploit the machinery of measure theory. This is a nice place to be when you have to deal with an asymptotically large number of propositions; possibly the Probability Wars were driven mostly by doing this so implicitly that we aren't clear about what we're putting *into* this machinery, and then, because the notation is similar, forgetting to explicitly go back to the world of propositions, which is where we want to be once we're done with the calculations.

What made me stare a bit at the wall is the other correspondence: Let's say that for some proposition $A$, $P[A] > P[\neg A]$ in the Bayesian sense (we're assuming the law of excluded middle, etc; this is about a different kind of confusing). Why should I bet that $A$? In other words, why the relationship between probability-as-certainty and probability-as-strategy? You can define probability based on a decision theoretic point of view (and historically speaking, that's how it was first thought of), but why the correspondence between those two conceptually different formulations?

It's a silly non-question with a silly non-answer, but I want to record it because it wasn't the first thing I thought of. I began by thinking about $P[\text{win} | (P(A) > P(\neg A)) \wedge \text{bet on } A]$, but that leads to a lot of circularity. It turns out that the forehead-smacking way to do it is simply to observe that the best strategy is to bet on $A$ is true iff $A$, and this isn't circular if we haven't yet assumed that probability-as-strategy is the same as probability-as-logic, but rather it's a non-tautological consequence of the assumed psychology and sociology of what bet on means: I should've done whatever ended up working, regardless of what the numbers told me (I'll try to feel less upset the next time somebody tells me that).

But then, in the sense of probability-as-logic, $P[\text{the best strategy is to bet on A}] = P[A]$ by substituting propositions (and hence without resorting to any frequentist assumption about repeated trials and the long term) so, generally speaking, you end up with probability-as-strategy being part of probability-as-logic. I'm likely counting angels dancing on infinitesimals here, but it's something it felt less clear to me earlier today: probability-as-strategy is probability-as-logic, you're just thinking about propositions about strategies, which, confusingly, in the simplest cases end up having the same numerical certainty values as the propositions the strategies are about. But those aren't the same propositions, although I'm not entirely sure that in practice, given the fundamentally intuitive nature of bet on (insert here very handwavy argument from evolutionary psychology about how we all descend from organisms who got this well enough not to die before reproducing), you get in trouble by not taking this into account.

Regularization, continuity, and the mystery of generalization in Deep Learning

A light and short note on a dense subset of a large space...

There's increasing interest in the very happy problem of why Deep Learning methods generalize so well in real-world usage. After all,

• Successful networks have ridiculous amounts of parameters. By all rights, they should be overfitting training data and doing awfully with new data.
• In fact, they are large enough to learn the classification of entire data sets even with random labels.
• And yet, they generalize very well.
• On the other hand, they are vulnerable to adversarial attacks with weird and entirely unnatural-looking inputs.

One possible very informal way to think about this — I'm not claiming it's an explanation, just a mental model I'm using until the community reaches a consensus as to what's going on — is the following:

• If the target functions we're trying to learn are (roughly speaking) nicely continuous (a non-tautological but often true property of the real world, where, e.g., changing a few pixels of a cat's picture rarely makes it cease to be one)...
• and regularization methods steer networks toward that sort of functions (partly as a side effect of trying to avoid nasty gradient blowups)...
• and your data set is more or less dense in whatever subset of all possible inputs is realistic...
• ... then, by a frankly metaphorical appeal to a property of continuous functions in Hausdorff spaces, learning well the target function on the training set implies learning well the function on the entire subset.

This is so vague that I'm having trouble keeping myself from making a political joke, but I've found it a wrong but useful model to think about how Deep Learning works (together with an, I think, essentially accurate model of Deep Learning as test-driven development) and how it doesn't.

As a bonus, this gives a nice intuition about why networks are vulnerable to weird adversarial inputs: if you only train the network with realistic data, no matter how large your data set, the most you can hope for is for it to be dense on the realistic subset of all possible inputs. Insofar as the mathematical analogy holds, you only get a guarantee of your network approximating the target function wherever you're dense; outside that subset — in this case, for unrealistic, weird inputs — all bets are off.

If this is true, protecting against adversarial examples might require some sort of specialized "realistic picture of the world" filters, as better training methods or more data won't help (in theory, you could add nonsense inputs to the data set so it can learn to recognize and reject it, but you'd need to pretty much cover the entire input subset with a dense set of samples, and if you're going to do that, then you might as well set up a lookup table, because you aren't anymore).

Why we should always keep Shannon in mind

Sometimes there's no school like old school. A couple of weeks ago I spent some time working with data from GitHub Archive, trying to come up with a toy model to predict repo behavior based on previous actions (will it be forked? will there be a commit? etc). My first attempt was to do a sort of brute-force Hidden Markov Model, synthesizing states from the last k actions such that the graph of state-to-state transition was as nice as possible (ideally, low entropy of belonging to a state, high entropy for the next state conditional on knowing the current one). The idea was to do everything by hand, as a way to get more experience with Hy in a work-like project.

All of this was fun (and had me dealing, weirdly enough, with memory issues in Python, although those might have been indirectly caused by Hy), but was ultimately the wrong approach, because, as I realized way, way too late, what I really wanted to do was just to predict the next action given a sequence of actions, which is the classical problem of modeling non-random string sequences (just consider each action a character in a fixed alphabet).

So I facepalmed and repeated to myself one of those elegant bits of early 20th-century mathematics we use almost every day and forget even more often: modeling is prediction is compression is modeling. It's all, from the point of view of information theory, just a matter of perspective.

If you haven't been exposed to the relationship of compression and prediction before, here's a fun thought experiment: if you had a perfect/good enough predictive model of how something behaves, you would just need to show the initial state and say "and then it goes as predicted for the next 10 GB of data", and that would be that. Instant compression! Having a predictive model lets you compress, and inside every compression scheme there's a hidden predictive model (for true enlightenment, go to Shannon's paper, which is still worthy of being read almost 70 years later).

As a complementary example, what the venerable Lempel-Ziv-Welch ("zip") compression algorithm does is, handwaving away bookkeeping details, to build incrementally a dictionary of the most frequent substrings, making sure sure that those are assigned the shorter names in the "translated" version. By the obvious counting arguments, this means infrequent strings will get names that are longer than they are, but on average you gain space (how much? entropy much!). But this also lets you build a barebones predictive model: given the dictionary of frequent substrings that the algorithm has built so far, look at your past history, see which frequent substrings extend your recent past, and assume one of them is going to happen — essentially, your prediction is "whatever would make for a shorted compressed version", which you know is a good strategy in general, because compressed versions do tend to be shorter.

So I implemented the core of a zip encoder in Hy, and then used it to predict github behavior. It's primitive, of course, and the performance was nothing to write a post about (which is why this post isn't called A predictive model of github behavior), but on the other hand, it's an extremely fast streaming predictive algorithm that requires zero configuration. Nothing I would use in a job &mdahs; you can get much better performance with more complex models, which are also the kind you get paid for — but it was educative to encounter a forceful reminder of the underlying mathematical unity of information theory.

In a world of multi-warehouse-scale computers and mind-bendingly complex inferential algorithms, it's good to remember where it all comes from.

The nominalist trap in Big Data analysis

Nominalism, formerly the novelty of a few, wrote Jorge Luis Borges, today embraces all people; its victory is so vast and fundamental that its name is useless. Nobody declares himself nominalist because there is nobody who is anything else. He didn't go on to write This is why even successful Big Data projects often fail to have an impact (except in some volumes kept in the Library of Babel), but his understandable omission doesn't make the diagnosis any less true.

Nominalism, to oversimplify the concept enough for the case at hand, is simply the assumption that just because there are many things in our world which we call chairs, that doesn't imply that the concept itself of a chair is real in a concrete sense, that there is an Ultimate, Really-Real Chair, perhaps standing in front of an Ultimate Table. We have things we call chairs, and we have the word "chair", and those are enough to furnish our houses and our minds, even if some carpenters still toss around at night, haunted by half-glimpses of an ideal one.

It has become a commonplace, quite successful way of thinking, so it's natural for it to be the basis of what's perhaps the "standard" approach to Big Data analysis. Names, numbers, and symbols are loaded into computers (account identifiers, action counters, times, dates, coordinates, prices, numbers, labels of all kinds), and then they are obsessively processed in an almost cabalistic way, organizing and re-organizing them in order to find and clarify whatever mathematical structure, and perhaps explanatory or even predictive power, they might have — and all of this data manipulation, by and large, takes place as if nothing were real but the relationships between the symbols, the data schemas and statistical correlations. Let's not blame the computers for it: they do work in Platonic caves filled with bits, with further bits being the only way in which they can receive news from the outside world.

This works quite well; well enough, in fact, to make Big Data a huge industry with widespread economic and, increasingly, political impact, but it can also fail in very drastic yet dangerously understated ways. Because, you see, from the point of view of algorithms, there *are* such things as Platonic ideals — us. Account 3788 is a reference to a real person (or a real dog, or a real corporation, or a real piece of land, or a real virus) and although we cannot right now put all of the relevant information about that person in a file, and associate it with the account number, that information, the fact of its being a person represented by a data vector, rather than a data vector, makes all the difference between the merely mathematically sophisticated analyst and the effective one. Properly performed, data analysis is the application of inferential mathematics to abstract data, together with the constant awareness and suspicion of the reality the data describes, and what this gap, all the Unrecorded bits, might mean for the problem at hand.

Massive multi-user games have failed because their strategic analysis confused the player-in-the-computer (who sought, say, silver) with the player-in-the-real-world (who sought fun, and cared for silver only insofar as that was fun). Technically flawless recommendation engines sometimes have no effect on user behavior, because even the best items were just boring to begin with. Once, I spent an hour trying to understand a sudden drop in the usage of a certain application in some countries but not in others, until I realized that it was Ramadan, and those countries were busy celebrating it.

Software programmers have to be nominalists — it's the pleasure and the privilege of coders to work, generally and as much as possible, in symbolic universes of self-contained elegance — and mathematicians are basically dedicated to the game of finding out how much truth can be gotten just from the symbols themselves. Being a bit of both, data analysts are very prone to lose themselves in the game of numbers, algorithms, and code. The trick is to be able to do so while also remembering that it's a lie — we might aim at having in our models as much of the complexity of the world as possible, but there's always (so far?) much more left outside, and it's part of the work of the analyst, perhaps her primary epistemological duty, to be alert to this, to understand how the Unrecorded might be the most important part of what she's trying to understand, and to be always open and eager to expand the model to embrace yet another aspect of the world.

The consequences of not doing this can be more than technical or economic. Contemporary civilization is impossible without the use of abstract data to understand and organize people, but the most terrible forms of contemporary barbarism, at the most demencial scales, would be impossible without the deliberate forgetfulness of the reality behind the data.

A short note to myself on Propp-Wilson sampling

Most of the explanations I've read of Propp-Wilson sampling describe the method in terms of "sampling from the past," in order to make sense of the fact that you get your random numbers before attempting to obtain a sample from the target distribution, and don't re-sample them until you succeed (hence the way the Markov chain is grown from $t_{-k}$ to $t_0$).

I find it more intuitive to think of this in terms of "sampling from deterministic universes." The basic hand-waving intuition is that instead of a non-deterministic system, you are sampling from a probabilistic ensemble of fully deterministic systems, so you first a) select the deterministic system (that is, the infinite series of random numbers you'll use to walk through the Markov chain), and b) run it until its story doesn't depend on the choice of original state. The result of this procedure will be a sample from the exact equilibrium distribution (because you have sampled from or "burned off" the two sources of distortion from this equilibrium distribution, the non-deterministic nature of the system and the dependence on the original state).

As I said, I think this is mathematically equivalent to Propp-Wilson sampling, although you'd have to tweak a bit the proofs. But it feels more understandable to me than other arguments I've read, so at least it has that benefit (assuming, of course, it's true).

PS: On the other hand "sampling from the past" is too fascinating a turn of phrase not to use, so I can see the temptation.

The perfectly rational conspiracy theorist

Conspiracy theorists don't have a rationality problem, they have a priors problem, which is a different beast. Consider a rational person who believes in the existence of a powerful conspiracy, and then reads an arbitrary online article; we'll denote by $C$ the propositions describing the conspiracy, and by $a$ the propositions describing the article's content. By Bayes' theorem,

$P(C|a) = \frac{P(a|C) P(C)}{P(a)}$

Now, the key here is that the conspiracy is supposed to be powerful. A powerful enough conspiracy can make anything happen or look like it happened, and therefore it'll generally be the case that $P(a|C) \geq P(a)$ (and usually $P(a|C) > P(a)$ for low-probability $a$, of which there are many in these days, as Stanislaw Lem predicted in The Chain of Chance). But that means that in general $P(C|a) \geq P(C)$, and often $P(C|a) > P(C)$! In other words, the rational evaluation of new evidence will seldom disprove a conspiracy theory, and will often reinforce its likelihood, and this isn't a rationality problem — even a perfect Bayesian reasoner will be trapped, once you get $C$ into its priors (this is a well-known phenomenon in Bayesian inference; I like to think of these as black hole priors).

Keep an eye open, then, for those black holes. If you have a prior that no amount of evidence can weaken, then that's probably cause for concern, which is but another form of saying that you need to demand falsifiability in empirical statements. From non-refutable priors you can do mathematics or theology (both of which segue into poetry when you are doing them right), but not much else.

Fractals are unfair

Let's say you want to identify an arbitrary point in a segment $I$ (chosen with an uniform probability distribution). A more or less canonical way to do this is to split the segment in two equal halves, and write down a bit identifying which half; now the size of the set where the point is hidden is one half of what it was. Because the half-segment we chose is affinely equivalent to the original one, we can repeat this as much as we want, gaining one bit of precision (halving the size of the "it's somewhere around here" set) for each bit of description. Seems fair.

It's easy to do the same on a square $I$ in $R^2$. Split the square in four squares, write down two bits to identify the one where the point is, repeat at will. Because each square has one fourth the size of the enclosing one, you gain two bits of precision for each two bits of description. Still fair (and we cannot do better).

Now try to do this on a fractal, say Koch's curve, and things go as weird as you'd think. You can always split it in four affinely equivalent pieces, but each of them is one-third the size of the original one, which means that you gain less than two bits of precision for each two bits of description. Now this is unfair. A fractal street would be a very good way of putting in an infinite number of houses inside a finite downtown, but (even approximate) driving directions will be longer than you'd think they should.

Of course, this is merely a paraphrasing of the usual definition of a fractal, which is an object whose fractal dimension exceeds its topological dimension (very hand-wavingly speaking, the number of bits of description in each step is higher than the number of bits of precision that you gain). But I do enjoy looking at things through an information theory lens.

Besides, there's a (still handwavy but clear) connection here with chaos, through the idea of trying to pin down the future of a system in phase space by pinning down its present. In conservative systems this is fair: one bit of precision in the present, gives you one bit of precision for the future (after all, volumes are preserved). But when chaos is involved this is no longer the case! For any fixed horizon, you need to put in a more bits of information about the present in order to get the same number of bits about the future.