Quantitatively understanding your (and others’) programming style

I’m not, in general, a fan of code metrics in the context of project management, but there’s something to be said for looking quantitatively at the patterns in your code, specially if by comparing them with those of better programmers, you can get some hopefully useful ideas on how to improve.

(As an aside, the real possibilities in computer-assisted learning won’t come from lower costs, but rather by a level of adaptability that so far not even one-on-one tutoring has allowed; if the current theories about expertise are more or less right, data-driven adaptive learning, if implemented at the right granularity level and with the right semantics model behind, could change the speed and depth the way we learn in a dramatic way… but I digress.)

Focusing on my ongoing learning of Hy, I haven’t used it in any paid project so far, but I’ve been able to play a bit with it now and then, and this has generated a very small code base, which I was curious to compare with code written by people who actually know the language. To do that, I downloaded the source code of a few Hy projects on GitHub (hyway, hygdrop, and adderall), and wrote some code (of course, in Hy) to extract code statistics.

Hy being a Lisp, its syntax is beautifully regular, so you can start by focusing on basic but powerful questions. The first one I wanted to know was: which functions am I using the most? And how does this distribution compares with that of the (let’s call it) canon Hy code?

My top five functions, in decreasing frequency: setv, defn, get, len, for.

Canon’s top five functions, in decreasing frequency: ≡, if, unquote, get, defn_alias.

Yikes! Just from this, it’s obvious that there are some serious stylistic differences, which probably reflect my still un-lispy understanding of the language (for example, I’m not using aliases, for should probably be replaced by more functional patterns, and the way I use setv, well, it definitely points out to the same). None of this is a “sin”, nor points clearly to how I could improve (which a sufficiently good learning assistant would have), but the overall trust of the data is a good indicator of where I still have a lot of learning to do. Fun times ahead!

For another angle at the quantitative differences between my newbie-to-lisp coding style and more accomplished programmers, here are the histograms of the log mean size of subexpressions for each function (click to expand):

log (mean subexpression size)

“Canonical” code shows a longer right tail, which shows that experienced programmers are not afraid of occasionally using quite large S-expressions… something I still clearly I’m still working my way up to (alternatively, which I might need to reconsider my aversion to).

In summary: no earth-shattering discoveries, but some data points that suggests specific ways in which my coding practice in Hy differs from that of more experienced programmers, which should be helpful as general guidelines as I (hopefully) improve over the long term. Of course, all metrics are projections (in the mathematical sense) — they hide more information than they preserve. I could make my own code statistically indistinguishable from the canon for any particular metric, and still have it be awful. Except for well-analyzed domains where known metrics are sufficient statistics for the relevant performance (and programming is very much not one of those domains, despite decades of attempts), this kind of analysis will always be about suggesting changes, rather than guaranteeing success.