At its best, OpenAI's DALL-E 2 can feel like God's image search engine: You give it a prompt and it conjures images from the vast database of the potential (neither this nor anything that follows is necessarily OpenAI's position or opinion on anything at all; I just got access for a while to the web interface for DALL-E 2 for external testing purposes). Or, depending on the metaphors you bring to the interaction, it can feel like a very quick — if sometimes quirky — freelancer.
Both metaphors are wrong, of course; the reality is much stranger. But there's much work and skill invested in eliding the conceptual weirdness of the process: much like with humans (or with Google, another wrapped-weirdness technological success), the intuitive experience is that language comes in, images comes out, and some sort of creation happens in the middle. Much of the coverage of DALL-E 2 and similar technologies focuses on this intermediate step: journalists and commentators discuss what DALL-E 2 "is doing," what it does or doesn't "understand," and so on.
It's a valid metaphor, but metaphors aren't facts - you can always pick new ones to play with.
Another way to look at DALL-E 2, not technically accurate either but perhaps conceptually closer, is as a pair of maps: one going from the language of your prompts into… something, and another from out of this something into images.
The usual term for that something in the middle is the beautifully evocative latent space. It's an space of potential images, and by moving your finger, as it were, along the world of language, DALL-E 2 moves through this latent space, and on the other side the images change as well.
It's important to remark that there's nothing unique about DALL-E 2's latent space. Train it again with a different random seed, never mind different parameters or architectural details, and you'll get an entirely different space with entirely different maps between that space and the worlds of language and images. On its own, it's also one of infinitely possible maps of something even more abstract, something that's common to language and images and the maps we use to bridge between them.
The mathematical concept is direct enough: mathematicians are trained to think like this right from the beginning, seeking not just maps between things but also maps between maps, and seeing all of them as examples of an underlying abstract thing in common that is often the real point of interest.
Using a tool like DALL-E 2 can be such a fascinating experience of language and images that it's easy to forget that we're exploring something far more abstract than either, but that might have — must have? — information about how we talk about the world and about our images of it. Not about the world itself necessarily, as DALL-E 2 isn't designed in that way, but we humans are as fascinated by how we see, talk, and think about the things as we are about the things itself.
This isn't something researchers and developers are ignoring: besides the intellectual interest of understanding structures in latent spaces, this understanding is used to design tools that can work as "volume knobs" for things like "happiness" or "price": find a path along the latent spaces that correlates with the a path of words ("happy", "content," "calm," "sad," etc) and you can change them just as easily as you can change contrast or brightness.
It's not, however, something that's often remarked on public accounts. That's to be expected: it's neither visible in the experience of using the tools nor as viscerally fascinating as seeing words become images.
Yet the long-term intellectual impact of looking at these latent spaces as objects of intrinsic interest can be just as deep as the practical impact of the tools themselves. People tend to think that the main use of mathematics in physics is for measurement and calculation, and there's some historical truth to that, but they were also necessary to allow us to design new languages to talk about the same aspects of the world. Modern physics measures and models things that have no direct correlation with our intuitive experience: maps of the world that are used and switched depending on what will make understanding clearer and computation easier in a given context.
We still have to go much further in other areas, with perhaps the arts themselves the least advanced in that direction. If its object is endlessly creative — you can make up art, and you shouldn't make up facts — its language, the coordinate systems we use to talk about images, books, poems, is conservative and contingent. Building new descriptive frameworks, new maps of what images and texts can exist, is often something of a creative act itself, fascinating and often enlightening, but missing on the intellectual toolset that allows us to make up and explore maps of anything we build a map of based on numbers.
Tools like DALL-E 2 and its fast-growing cohort of AIs dealing with anything from text, images, and music to chemical structures and even game strategies point not just at practical applications, but are built around abstract quantitative mappings of aspects of the world we hadn't yet mapped in this way. None of these maps is deeply significant; none of them is the "true" shape of language or images. But by exploring them with the care and curiosity with which we would analyze the index of a well-known library made by somebody from a wholly different culture (or even species!), we will gain new ways to look at what we read, see, and listen, what we create and what we haven't yet thought of creating because our imaps didn't show those blanks.
The AI illustrator is now here, and, depending on your definitions and philosophy, the AI artist may be here or close.
Few people are waiting for the AI critic – but it will also, on its way, change the world.