How to think about that Google "mind reading with Stable Diffusion" paper

There's a paper doing the rounds online (again) about how now you can "read minds with Stable Diffusion"; as it's often the case, it's very impressive but not what the hype (or criti-hype) headlines (and often the full articles...) say, I thought it'd be good to give a very short and informal description of what they did and didn't, and what parts of that are the really interesting ones. A quickly drafted first impression, not a formal analysis, so caveat reader.

Here's the paper. Of course, if you can read and follow a paper that's always better than reading a tweet about a post about an article about it.

Anyway, this is the short of what they did (although I'm fibbing about some of it — they didn't really do the experiments, they used an existing data set).

Pick four (4) people.
Put them inside an fMRI machine.
Show them a couple of tens of thousands of images and record their brain activity on each.
Pick up an existing image generation model. Remember: an image generation model is a program train to turn a text into a latent representation (a big numeric vector) and that into an image.
Train a (relatively) very simple model to predict, from the brain activity from watching an image and its description, the latent vector the image generation model would have for that image and description.

Then they showed them other images, and used their brain activity to get a latent vector used the small model, and used the large model to generate an image. And those images are (sometimes) quite close to the original images!

That's cool. Very cool. But what's going on?

What they aren't doing is what you would call "mind reading": they aren't showing an arbitrary random image to a random person, reading their brain activity, and then figuring out the image just from that. However, for those four people, for that sort of images, in that context... it's sort of that?

Now, the fact that you can map brain activity to latent vectors in a good image generation model and reconstruct images that way isn't, by itself, huge. An image generation model is by construction a sort of index of pretty much every reasonable image (or at least every reasonable image on the internet) plus many of their semantically reasonable interpolations, so it stands to reason that you can use it to encode pretty much every reasonable image you'd show.

What is really interesting is that you can build that map in a relatively simple way, with (comparatively) few data points and (comparatively, still huge) few degrees of freedom in the model. What's interesting is not that it can be done (with enough parameters you can map pretty much anything to anything), but that it can be done in a relatively simple way.

As analogy, what Brahe/Kepler/Newton did wasn't predicting astronomical observations, that was already in our bag of tricks: it was doing it in a simple way. They showed that if you used the right mathematics astronomy was simple, and in fact you could just put it together with Earthbound physics and the whole was still simple. Compiling large data sets isn't trivial — it takes organization, culture, engineering, resources — but building simple, consistent, and generalizable explanations out of those - that's when it gets really powerful.

So what can we suspect from what they did? Let's leave aside the fact that they did it for four people. Four isn't one but it's not a lot; the conservative assumption would be that the model is really four individual models meshed up, and that if you tried to do the same with more people and the same number of parameters you'd just get poorer results. The maximalist assumption that these four can be used for any random person, I think, is much more unlikely.

Even worst case, the fact that a lower-dimension mapping works suggests this:

We already knew, because we can build image generation models, that the semantic space of Internet images is relatively low dimensional.
We now see that a relatively low-dimensional projection of fMRI measurements seems to be also good enough for at least a nontrivial number of images. That this would be possible from very precise measurements was already to be expected (because of the point above); that the fMRI measurements we have are good enough to do it sort-of-well is interesting as a matter of the biology of neural coding — it looks like the brain isn't subtle while encoding the most important dimensions necessary for reconstructing an image, which is good engineering after all.
That this projection is relatively simple to compute is very interesting. "Neurons" in neural networks are at best the simplest of toy models of how neurons work (which is to say, not really models). Yet it looks like the way the (their?) brain encodes images and the way a trained image generation model encodes images (both adjusted to maximize semantic similarity) are close enough to be mathematically transformable into each other with less trouble than you could expect. Does this mean that the visual semantics of the world as we see and describe it are simple enough that, within a broad set of encoding mechanisms that include both whatever brains do and whatever our image generation models do, you end up with (very) relatively similar encodings?

On the other hand, I'm throwing around the idea of "simple" and "few" while talking actually about thousands of dimensions. This might be insanity on my part: maybe the fact that we our image spaces live in potentially absurdly high-dimensional spaces but are all encodeable in much smaller ones is making us feel that we're doing something special with that encoding, while in truth that "much smaller" space is large enough that we are actually still absurdly wasteful. In the analogy above, maybe we are still pre-Newtonian.

In technological terms... Well, the first thing to look at would be how the same model performs with random people; there's a difference between learning to reconstruct images for a single person (or a few) and for random or nearly random ones. It'd also be a nice thing to know for neuroscientists! The second thing to note is that fMRIs aren't the most portable of BCI platforms, but that doesn't worry me much, because the fact that you can get a low dimensional encoding should give us hope that it'd be possible to do the same with less-than-awesome BCI sensors, and that is awesome. We might be closer to arbitrary image reconstruction from brain data than we thought we were!

All in all, this is a cool experiment with more implications, perhaps, than just the technical proof of concept. We are ultimately learning things about the mathematical structure of our perception and language, and that's going to have a large impact not just on the applications side but also on our understanding of neuroscience and on more than one branch of philosophy.

Interesting times: the more so the less you look at the headlines and the more you focus on what the details imply.

None