Following a friend's suggestion, here's a comparison of phrase length distributions between the English and German versions of The Adventures of Tom Sawyer:
It could be interesting to parametrize these distributions and try to characterize languages in terms of some sort of encoding mechanism (e.g., assume phrase semantics are drawn randomly from a language-independent distribution and renderings in specific languages are mappings from that distribution to sequences of words, and handwave about what cost metric the mapping is trying to minimize).