Using machine learning to read Sherlock Holmes

A while ago I posted about how to use machine learning to understand brand semantics by mining Twitter data — not just to count mentions, but to map the similitudes and differences in how people think about them. But individual tweets are brief snapshots, just a few words written and posted in an instant of time. We can use the same methods to understand the flow of meaning along longer texts, to seek patterns in and between stories.

For a quick first example, I downloaded three books of Sherlock Holmes short stories: The Adventures of Sherlock Holmes, The Memoirs of Sherlock Holmes, and The Return of Sherlock Holmes. The main reason is that I like them. Secondary reasons are that Holmes needs no introduction, and that the relatively stable structure of the plots makes it more likely the algorithms will have something to work with.

After extracting the thirty stories included in the books and splitting each story into its component paragraphs, I ran each paragraph through Facebook Research’s InferSet sentence embedding library. Similar to its counterparts from Google and elsewhere, it converts a text fragment into a point in an abstract 4096-dimensional space, in such a way that fragments with statistically similar meanings are mapped to points close to each other, and the geometric relationships between points encode their semantic differences (if this sounds a bit vague, the link above works through a concrete example, although mapping individual words instead of longer fragments).

The first question I wanted to ask the data — and the only one in this hopefully short post — is the most basic of any plot: are we at the beginning or at the end of the story? Even ignoring obvious marks like “Once upon a time…” and “The End”, readers of any genre quickly develop a fairly robust sense of how plots work, whether it’s a detective story or a romantic one. We have expectations about beginnings, middles, and endings that might be subverted by writers, but only because we have them. At the same time, it’s a “soft” cultural concept, of the type not traditionally seen as amenable to statistical analysis.

As Holmes would have suggested, I went to it in a methodical manner. The first order of business was, as always, to define the problem. I had between one and three hundred abstract points (one for each paragraph) for each story. To clarify what counts as beginning, middle, and end, I split each story into five segments — the first fifth of paragraphs, the second fifth of paragraphs, and so on — and took the first, third, and last segments as the story’s beginning, middle, and end. As usual when using this sort of embedding technique, I summarized each segment simply by taking the average of all the paragraphs/points in it (that’s not always the best-performing technique, but works as a reasonable baseline).

So now I had reduced the entire text of each story into three abstract points, each point a collection of 4096 number. Watson would doubtless have been horrified. Did all this statistical scrambling leave us with enough information to tell, without cheating, which points are beginnings, middles, and ends?

Plots in 4096 aren’t always clear, but a two-dimensional isometric embedding wasn’t very promising:

There are thirty points of each type, one for each story, but, as you can see, there isn’t any clear pattern that would allow us to say, if it weren’t for the color coding, wich ones are beginnings, middles, or ends. (Of course, this doesn’t prove there isn’t an obvious pattern in their original 4096-dimensional space; but we’re trying for now to explore human-friendly ways to probe the data.)

There was still hope, though. After all, our intuition for plots aren’t based exactly on what’s being told at each point, but on how it changes along the story; the same marriage that ends a romantic comedy can be the prologue of an slowly unfolding drama.

Fortunately, we can use the abstract points we extracted to calculate differences between texts. Just as the difference between the vectors for the words “king” and “queen” is similar to the difference between the vectors for “man” and “woman” (and therefore encode, in a way, the semantics of the difference in gender, at least in the culture- and context-specific ways present in the language patterns used to train the algorithm), the difference between the summary vectors for the beginning, middle, and end of stories encode… something, hopefully.

And, indeed, they do:

Let’s be careful to understand what the evidence is telling us. Each point encodes mathematically the difference between the summary meanings of two fragments of text. The red points describe statistically the path between the beginning and middle of a story, and the green ones betwee that middle and the end. The fact that red and green points are pretty well grouped, even when plotted in two instead of their native 4096 dimensions, indicates that, even after we reduced Doyle’s prose to a few numbers, there’s still enough statistical structure to distinguish the path, as it were, between the beginning of a story and its middle, and thence to its resolution. In other words, just as online advertisers and political analysts do “sentiment analysis” as a matter of course, it’s also possible to do “plot analysis.”

A final observation. Low-dimensional clusters are seldom perfectly clean, but there’s an exception in the graph above that merits a closer look:

The points highlighted correspond to the beginning-to-middle and middle-to-end transitions of The Crooked Man. The later transition is reasonably positioned in our graph, but why does the first half of the story look, to our algorithm’s cold and unsympathetic eyes, as a second half? At first blush it looks like a serious blunder.

It could be; explaining and verifying the output of any non-trivial model requires careful analysis beyond the scope of this quick post. But if you read the story (I linked to it above), you’ll note that the middle of it, which begins with Watson’s poetic “A monkey, then?”, continues with Holmes explaining a series of inferences he made. That’s usually how Doyle closes his plots, and in fact the rest of the story is just Holmes and Watson taking a train and interrogating somebody.

This case suggests that our quick analysis wasn’t entirely off the mark: the algorithm picked a real pattern in Doyle’s writing, and raised an alarm when the author strayed from it.

The family of algorithms in this post descends from those that revolutionized computer translation years ago; since then, they have continuously helped build tools that help computers understand written and spoken language, as well as images, in ways that complement and extend what humans do. Quantitative approaches to text structure will certainly open new opportunities as well.