Category Archives: Programming

Deep Learning as the apotheosis of Test-Driven Development

Even if you aren't interested in data science, Deep Learning is an interesting programming paradigm; you can see it as "doing test-driven development with a ludicrously large number of tests, an IDE that writes most of the code, and a forgiving client." No wonder everybody's pouring so much money and brains into it! Here's a way of thinking about Deep Learning not as an application you're asked to code, but a language to code with.

Deep Learning applies test-driven development as we're all taught to (and not always do): first you write the tests, and then you move from code that fails all of them to one that passes them all. One difference from the usual way of doing it, and the most obvious, is that you'll usually have anything from hundreds of thousands to Google-scale numbers of test cases in the way of pairs (picture of a cat, type of cute thing the cat is doing), or even a potentially infinite number that look like pairs (anything you try, how badly Donkey Kong kills you). This gives you a good chance that, if you selected or generated them intelligently, the test cases represent the problem well enough that a program that passes them will work in the wild, even if the test cases are all you know about the problem. It definitely helps that for most applications the client doesn't expect perfect performance. In a way, this lets you get away with the problem of having to get and document domain knowledge, at least for reasonable-but-not-state-of-the-art levels of performance, which is specially hard to do for to things like understanding cat pictures, because we just don't know how we do it.

The second difference between test-driven development with the usual tools and test-driven development with Deep Learning languages and runtimes is that the latter are differentiable. Forget the mathematical side of that: the code monkey aspect of it is that when a test case fails, the compiler can fix the code on its own.

Yep.

Once you stop thinking about neural networks as "artificial brains" or data science-y stuff, and look at them as a relatively unfamiliar form of bytecode — but, as bytecode goes, also a fantastically simple one — then all that hoopla about backpropagation algorithms is justified, because they do pretty much what we do: look at how a test failed and then work backwards through the call stack, tweaking things here and there, and then running the test suite again to see if you fixed more tests than you broke. But they do it automatically and very quickly, so you can dedicate yourself to collecting the tests and figuring out the large scale structure of your program (e.g. the number and types of layers in your network, and their topology) and the best compiler settings (e.g., optimizing hyperparameters and setting up TensorFlow or whatever other framework you're using; they are labeled as libraries and frameworks, but they can also be seen as compilers or code generators that go from data-shaped tests to network-shaped bytecode).

One currently confusing fact is that this is all rather new, so very often the same people who are writing a program are also improving the compiler or coming up with new runtimes, so it looks like that's what programming with Deep Learning is about. But that's just a side effect of being in the early "half of writing the program is improving gcc so it can compile it" days of the technology, where things improve by leaps and bounds (we have both a fantastic new compiler and the new Internet-scale computers to run it), but are also rather messy and very fun.

To go back to the point: from a programmer's point of view, Deep Learning isn't just a type of application you might be asked to implement. They are also a language to write things with, one with its own set of limitations and weak spots, sure, but also with the kind of automated code generation and bug fixing capabilities that programmers have always dreamed of, but by and large avoid because doing it with our usual languages involves a lot of maths and the kind of development timelines that makes PMs either laugh or cry.

Well, it still does, but with the right language the compiler takes care of that, and you can focus on high-level features and getting the test cases right. It isn't the most intuitive way of working for programmers trained as we were, and it's not going to fully replace the other languages and methods in our toolset, but it's solving problems that we thought were impossible. How can a code monkey not be fascinated by that?

The job of the future isn't creating artificial intelligences, but keeping them sane

Once upon a time, we thought there was such a thing as bug-free programming. Some organizations still do — and woe betide their customers — but after a few decades hitting that particular wall, the profession has by and large accepted that writing software is such an extremely complex intellectual endeavor that errors and unfounded assumptions are unavoidable. Even the most mathematically solid of formal methods has, if nothing else, to interact with a world of unstable platforms and unreliable humans, and what worked today will fail tomorrow.

So we spend time and resources maintaining what we already "finished," fixing bugs as they are found, and adapting programs to new realities as they develop. We have to, because when we don't, as when physical infrastructure isn't maintained, we save resources in the short term, but only in our way towards protracted ruin.

It's no surprise that this also happens with our most sofisticated data-driven algorithms. CVs and scrum boards are filled with references to the maintenance of this or that prediction or optimization algorithm.

But there's a subtle, not universal but still very prevalent, problem: those aren't software bugs. This isn't to say that implementations don't have bugs; being software, they do. But they are computer programs implementing inference algorithms, which work at a higher level of abstraction, and those have their own kinds of bugs, and those don't leave stack traces behind.

A clear example is the experience of Google. PageRank was, without a doubt, among the most influential algorithms in the history of the internet, not to mention the most profitable, but as Google took the internet over by storm, gaming PageRank became such an important business activity that "SEO" became a commonplace word.

From an algorithmic point of view this simply a maintenance problem: PageRank assumed a certain relationship between link structure and relevance, based on the assumption that website creators weren't trying to fool it. Once this assumption became untenable, the algorithm had to be modified to cope with a world of link farms and text written with no human reader in mind.

In (very loosely equivalent) software terms, there was a new threat model, so Google had to figure out and apply a security patch. This is, for any organization facing a simular issue, a continual business-critical process, and one that could make or break a company's profitability (just ask anybody working on high-frequency trading). But not all companies deploy the same sort of detailed, continuous instrumentalization, and development and testing methodologies that they use to monitor and fix their software systems to their data driven algorithms independently of their implementations. The same data scientist who developed an algorithm is often in charge of monitoring its performance on a more or less regular basis; or, even worse, it's only a hit to business metrics what makes companies reassingn their scarce human resources towards figuring out what's going wrong. Either monitoring and maintenance strategy would amount to criminal malpractice if we were talking about software, yet there are companies for which is this is the norm.

Even more prevalent is the lack of automatic instrumentalization for algorithms mirroring that for servers. Any organization with a nontrivial infrastructure is well aware of, and has analysis tools and alarms for, things like server load or application errors. There are equivalent concepts for data-driven algorithms — quantitative statistical assumptions, wildly erroneous predictions — that should, also, be monitored in real time, and not collected (when the data is there) by a data scientist only after the situation has become bad enough to be noticed.

None of this is news to anybody working with big data, particularly in large organizations centered around this technology, but we have still to settle on a common set of technologies and practices, and even just on an universal agreement on its need.

These days nobody would dare deploy a web application trusting only server logs at the operating system level. Applications have their own semantics, after all, and everything in the operating system working perfectly is no guarantee that the app is working at all.

Large-scale prediction and optimization algorithms are just the same; they are often an abstraction running over the application software that implements them. They can be failing wildly, statistical assumptions unmet and parameters converging to implausible values, with nothing in the application layer logging even a warning of any kind.

Most users forgive a software bug much more easily than unintelligent behavior in avowedly intelligent software. As a culture, we're getting used to the fact that software fails, but many still buy the premise that artificial intelligence doesn't (this is contradictory, but so are all myths). Catching these errors as early as possible can only be done while algorithms are running in the real world, where the weird edge cases and the malicious users are, and this requires metrics, logs, and alarms that speak of what's going on in the world of mathematics, not software.

We haven't converged yet on a standard set of tools and practices for this, but I know many people who'll sleep easier once we have.

Quantitatively understanding your (and others') programming style

I'm not, in general, a fan of code metrics in the context of project management, but there's something to be said for looking quantitatively at the patterns in your code, specially if by comparing them with those of better programmers, you can get some hopefully useful ideas on how to improve.

(As an aside, the real possibilities in computer-assisted learning won't come from lower costs, but rather by a level of adaptability that so far not even one-on-one tutoring has allowed; if the current theories about expertise are more or less right, data-driven adaptive learning, if implemented at the right granularity level and with the right semantics model behind, could change the speed and depth the way we learn in a dramatic way... but I digress.)

Focusing on my ongoing learning of Hy, I haven't used it in any paid project so far, but I've been able to play a bit with it now and then, and this has generated a very small code base, which I was curious to compare with code written by people who actually know the language. To do that, I downloaded the source code of a few Hy projects on GitHub (hyway, hygdrop, and adderall), and wrote some code (of course, in Hy) to extract code statistics.

Hy being a Lisp, its syntax is beautifully regular, so you can start by focusing on basic but powerful questions. The first one I wanted to know was: which functions am I using the most? And how does this distribution compares with that of the (let's call it) canon Hy code?

My top five functions, in decreasing frequency: setv, defn, get, len, for.

Canon's top five functions, in decreasing frequency: ≡, if, unquote, get, defn_alias.

Yikes! Just from this, it's obvious that there are some serious stylistic differences, which probably reflect my still un-lispy understanding of the language (for example, I'm not using aliases, for should probably be replaced by more functional patterns, and the way I use setv, well, it definitely points out to the same). None of this is a "sin", nor points clearly to how I could improve (which a sufficiently good learning assistant would have), but the overall trust of the data is a good indicator of where I still have a lot of learning to do. Fun times ahead!

For another angle at the quantitative differences between my newbie-to-lisp coding style and more accomplished programmers, here are the histograms of the log mean size of subexpressions for each function (click to expand):

log (mean subexpression size)

"Canonical" code shows a longer right tail, which shows that experienced programmers are not afraid of occasionally using quite large S-expressions... something I still clearly I'm still working my way up to (alternatively, which I might need to reconsider my aversion to).

In summary: no earth-shattering discoveries, but some data points that suggests specific ways in which my coding practice in Hy differs from that of more experienced programmers, which should be helpful as general guidelines as I (hopefully) improve over the long term. Of course, all metrics are projections (in the mathematical sense) — they hide more information than they preserve. I could make my own code statistically indistinguishable from the canon for any particular metric, and still have it be awful. Except for well-analyzed domains where known metrics are sufficient statistics for the relevant performance (and programming is very much not one of those domains, despite decades of attempts), this kind of analysis will always be about suggesting changes, rather than guaranteeing success.

Hi, Hy!

Currently trying Hy as a drop-in replacement for Python in a toy project. It's interesting how much of the learning curve for Lisp goes away once you have access to an underlying runtime you're familiar with; the small stuff generates more friction than the large differences (which makes sense, as we do the small stuff more often).

A flow control structure that never makes mistakes (sorta)

I've been experimenting with Lisp-style ad-hoc flow control structures. Nothing terribly useful, but nonetheless amusing. E.g., here's a dobest() function that always does the best thing (and only the best thing) among the alternatives given to it — think of the mutant in Philip K. Dick's The Golden Man, or Nicolas Cage in the awful "adaptation" Next.

Here's how you use it:

if __name__ == '__main__':
 
    def measure_x():
        "Metric function: the value of x"
        global x
        return x
 
    def increment_x():
        "A good strategy: increment x"
        global x
        x += 1
 
    def decrement_x():
        "A bad strategy: decrement x"
        global x
        x -= 1
 
    def fail():
        "An even worse strategy"
        global x
        x = x / 0
 
    x = 1
    # assert(x == 1)
    dobest(measure_x, increment_x, decrement_x, fail)
    # assert(x == 2)

You give it a metric, a function that returns how good you think the current world is, and one or more functions that operate on the environment. Perhaps disappointingly dobest() doesn't actually see the future; rather, it executes each function on a copy of the current environment, and only transfers to the "real" one the environment with the highest value of metric().

Here's the ugly detail (do point out errors, but please don't mock too much; I haven't played much with Python scopes):

def dobest(metric, *branches):
    "Apply every function in *branches to a copy of the caller's environnment, only do 'for real' the best one according to the result of running metric()."
 
    world = copy.copy(dict(inspect.getargvalues(sys._getframe(1)).locals))
    alts = []
 
    for branchfunction in branches:
        try:
            # Run branchfunction in a copy of the world
            ns = copy.copy(world)
            exec branchfunction.__code__ in ns, {}
            alts.append(ns)
        except: # We ignore worlds where things explode
            pass
 
    # Sort worlds according to metric()
    alts.sort(key=lambda ns: eval(metric.__code__, ns, {}), reverse=True)
    for key in alts[0]:
        sys._getframe(1).f_globals[key] = alts[0][key]

One usability point is that the functions you give to dobest() have to explicitly access variables in the environment as global; I'm sure there are cleaner ways to do it.

Note that this also can work a bit like a try-except with undo, a la

dobest(bool, function_that_does_something, function_that_reports_an_error)

This would work like try-except, because dobest ignores functions that raise exceptions, but with the added benefit that dobest would clean up everything done by function_that_does_something.

Of course, and here's the catch, "everything" is kind of limited — I haven't precisely gone out of my way to track and catch all side effects, not that it'd even be possible without some VM or even OS support. Point is, the more I get my ass saved by git, the more I miss it in my code, or even when doing interactive data analysis with R. As the Doctor would say, working on just one timeline can be so... linear.

A mostly blank slate

The combination of a tablet and a good Javascript framework makes it very easy to deploy algorithms to places where so far they have been scarce, like meetings, notetaking, and so on. The problem lies in figuring out what those algorithms should be; just as we had to have PCs for a few years before we started coming up with things to do with them (not that we have even scratched the surface), we still don't have much of a clue about how to use handheld thinking machines outside "traditional thinking machine fields."

Think about it this way: computers have surpassed humans in the (Western Civilization's) proverbial game of strategy and conflict, chess, and are useful enough in games of chance that casinos and tournament organizers are prone to use anything from violence to lawyers to keep you from using them. So the fact that we aren't using a computer when negotiating or coordinating says something about us.

The bottleneck, Cassius would say nowadays, is not in our tech, but in our imaginations.

Men's bathrooms are (quantum-like) universal computers

As it's well documented, men choose urinals to maximize their distance from already occupied urinals. Letting u_i be 1 or 0 depending on whether urinal $i$ is occupied, and setting \sigma_{i,j} to the distances between urinals i and j, male urinal choosing behavior can be seen as an annealing heuristic maximizing

\sum u_i u_j \sigma_{i,j}

(summing over repeating indexes as per Einstein's convention). And this is obviously equivalent to the computational model implemented by D-Wave's quantum computers! Hardware implementations of urinal-based computers might be less compact than quantum ones (and you might need bathrooms embedded in non-Euclidean spaces in order to implement certain programs), but they are likely to be no more error-prone, and they are certain to escalate better.