Tom Sawyer, Bilingual

Following a friend’s suggestion, here’s a comparison of phrase length distributions between the English and German versions of The Adventures of Tom Sawyer:

Tom Sawyer Phrase Lengths

It could be interesting to parametrize these distributions and try to characterize languages in terms of some sort of encoding mechanism (e.g., assume phrase semantics are drawn randomly from a language-independent distribution and renderings in specific languages are mappings from that distribution to sequences of words, and handwave about what cost metric the mapping is trying to minimize).