Category Archives: Literature

Using machine learning to read Sherlock Holmes

A while ago I posted about how to use machine learning to understand brand semantics by mining Twitter data — not just to count mentions, but to map the similitudes and differences in how people think about them. But individual tweets are brief snapshots, just a few words written and posted in an instant of time. We can use the same methods to understand the flow of meaning along longer texts, to seek patterns in and between stories.

For a quick first example, I downloaded three books of Sherlock Holmes short stories: The Adventures of Sherlock Holmes, The Memoirs of Sherlock Holmes, and The Return of Sherlock Holmes. The main reason is that I like them. Secondary reasons are that Holmes needs no introduction, and that the relatively stable structure of the plots makes it more likely the algorithms will have something to work with.

After extracting the thirty stories included in the books and splitting each story into its component paragraphs, I ran each paragraph through Facebook Research's InferSet sentence embedding library. Similar to its counterparts from Google and elsewhere, it converts a text fragment into a point in an abstract 4096-dimensional space, in such a way that fragments with statistically similar meanings are mapped to points close to each other, and the geometric relationships between points encode their semantic differences (if this sounds a bit vague, the link above works through a concrete example, although mapping individual words instead of longer fragments).

The first question I wanted to ask the data — and the only one in this hopefully short post — is the most basic of any plot: are we at the beginning or at the end of the story? Even ignoring obvious marks like "Once upon a time..." and "The End", readers of any genre quickly develop a fairly robust sense of how plots work, whether it's a detective story or a romantic one. We have expectations about beginnings, middles, and endings that might be subverted by writers, but only because we have them. At the same time, it's a "soft" cultural concept, of the type not traditionally seen as amenable to statistical analysis.

As Holmes would have suggested, I went to it in a methodical manner. The first order of business was, as always, to define the problem. I had between one and three hundred abstract points (one for each paragraph) for each story. To clarify what counts as beginning, middle, and end, I split each story into five segments — the first fifth of paragraphs, the second fifth of paragraphs, and so on — and took the first, third, and last segments as the story's beginning, middle, and end. As usual when using this sort of embedding technique, I summarized each segment simply by taking the average of all the paragraphs/points in it (that's not always the best-performing technique, but works as a reasonable baseline).

So now I had reduced the entire text of each story into three abstract points, each point a collection of 4096 number. Watson would doubtless have been horrified. Did all this statistical scrambling leave us with enough information to tell, without cheating, which points are beginnings, middles, and ends?

Plots in 4096 aren't always clear, but a two-dimensional isometric embedding wasn't very promising:

There are thirty points of each type, one for each story, but, as you can see, there isn't any clear pattern that would allow us to say, if it weren't for the color coding, wich ones are beginnings, middles, or ends. (Of course, this doesn't prove there isn't an obvious pattern in their original 4096-dimensional space; but we're trying for now to explore human-friendly ways to probe the data.)

There was still hope, though. After all, our intuition for plots aren't based exactly on what's being told at each point, but on how it changes along the story; the same marriage that ends a romantic comedy can be the prologue of an slowly unfolding drama.

Fortunately, we can use the abstract points we extracted to calculate differences between texts. Just as the difference between the vectors for the words "king" and "queen" is similar to the difference between the vectors for "man" and "woman" (and therefore encode, in a way, the semantics of the difference in gender, at least in the culture- and context-specific ways present in the language patterns used to train the algorithm), the difference between the summary vectors for the beginning, middle, and end of stories encode... something, hopefully.

And, indeed, they do:

Let's be careful to understand what the evidence is telling us. Each point encodes mathematically the difference between the summary meanings of two fragments of text. The red points describe statistically the path between the beginning and middle of a story, and the green ones betwee that middle and the end. The fact that red and green points are pretty well grouped, even when plotted in two instead of their native 4096 dimensions, indicates that, even after we reduced Doyle's prose to a few numbers, there's still enough statistical structure to distinguish the path, as it were, between the beginning of a story and its middle, and thence to its resolution. In other words, just as online advertisers and political analysts do "sentiment analysis" as a matter of course, it's also possible to do "plot analysis."

A final observation. Low-dimensional clusters are seldom perfectly clean, but there's an exception in the graph above that merits a closer look:

The points highlighted correspond to the beginning-to-middle and middle-to-end transitions of The Crooked Man. The later transition is reasonably positioned in our graph, but why does the first half of the story look, to our algorithm's cold and unsympathetic eyes, as a second half? At first blush it looks like a serious blunder.

It could be; explaining and verifying the output of any non-trivial model requires careful analysis beyond the scope of this quick post. But if you read the story (I linked to it above), you'll note that the middle of it, which begins with Watson's poetic "A monkey, then?", continues with Holmes explaining a series of inferences he made. That's usually how Doyle closes his plots, and in fact the rest of the story is just Holmes and Watson taking a train and interrogating somebody.

This case suggests that our quick analysis wasn't entirely off the mark: the algorithm picked a real pattern in Doyle's writing, and raised an alarm when the author strayed from it.

The family of algorithms in this post descends from those that revolutionized computer translation years ago; since then, they have continuously helped build tools that help computers understand written and spoken language, as well as images, in ways that complement and extend what humans do. Quantitative approaches to text structure will certainly open new opportunities as well.

A first look at phrase length distribution

Here's a sentence length vs. frequency distribution graph for Chesterton, Poe, and Swift, plus Time of Punishment.

Phrase length distribution

A few observations:

  • Take everything with a grain of salt. There are features here that might be artifacts of parsing and so on.
  • That said, it's interesting that Poe seems to fancy short interjections more than Chesterton does (not as much as I do, though).
  • Swift seems to have a more heterogeneous style in terms of phrase lengths, compared with Chesterton's more marked preference for relatively shorter phrases.
  • Swift's average sentence length is about 31 words, almost twice Chesterton's 18 (Poe's is 21, and mine is 14.5). I'm not sure how reasonable that looks.
  • Time of Punishment's choppy distribution is just an artifact of the low number of samples.

Chesterton's magic word squares

Here are the magic word squares for a few of Chesterton's books. Whether and how they reflect characteristics that differentiate them from each other is left as an exercise to the reader.


the same way of this
world was to it has
and not think would always
i have been indeed believed
am no one thing which

The Man Who Was Thursday

the man of this agreement
professor was his own you
had the great president are
been marquis started up as
broken is not to be

The Innocence of Father Brown

the other side lay like
priest in that it one
of his is all right
this head not have you
agreement into an been are

The Wisdom of Father Brown

the priest in this time
other was an agreement for
side not be seen him
explained to say you and
father brown he had then

Magic Squares of (probabilistically chosen) Words

Thinking about magic squares, I had the idea of doing something roughly similar with words, but using usage patterns rather than arithmetic equations. I'm pasting below an example, using statistical data from Poe's texts:


the same manner as if
most moment in this we
intense and his head were
excitement which i have no
greatly he could not one

The word on the top-left cell in the grid is the most frequently used in Poe's writing, "the" — unsurprisingly so, as it's the most frequently used word in the English language. Now, the word immediately to its right, "same," is there because "same" is one of the words that follows "the" most often in the texts we're looking at. The word below "the" is "most" because it also follows "the" very often. "Moment" is set to the right of "most" and below "same" because it's the word that most frequently follows both.

The same pattern is used to fill the entire 5-by-5 square. If you start at the topmost left square and then move down and/or to the right, although you won't necessarily be constructing syntactically correct phrases, the consecutive word pairs will be frequent ones in Poe's writing.

Although there are no ravens or barely sublimated necrophilia in the matrix, the texture of the matrix is rather appropriate, if not to Poe, at least to Romanticism. To convince you of that, here are the equivalent 5-by-5 matrices for Swift and Chesterton.


the world and then he
same in his majesty would
manner a little that it
of certain to have is
their own make no more


the man who had been
other with that no one
and his it said syme
then own is i could
there are only think be

At least compared against each other, it wouldn't be too far fetched to say that Poe's matrix is more Poe's than Chesterton's, and vice versa!

PS: Because I had a sudden attack of curiosity, here's the 5-by-5 matrix for my newest collection of short stories, Time of Punishment (pdf link).

Time of Punishment

the school whole and even
first dance both then four
charge rants resistance they think
of a hundred found leads
punishment new astronauts month sleep