Tom Sawyer, Bilingual

2012-12-18

Following a friend's suggestion, here's a comparison of phrase length distributions between the English and German versions of The Adventures of Tom Sawyer:

It could be interesting to parametrize these distributions and try to characterize languages in terms of some sort of encoding mechanism (e.g., assume phrase semantics are drawn randomly from a language-independent distribution and renderings in specific languages are mappings from that distribution to sequences of words, and handwave about what cost metric the mapping is trying to minimize).