For the unexpected innovations, look where you'd rather not

Before Bill Gates was a billionaire, before the power, the cultural cachet, and the Robert Downey Jr. portrayals, computers were for losers who would never get laid. Their potential was of course independent of these considerations, but Steve Jobs could become one of the richest people on Earth because he was fascinated with, and dedicated time to, something that cool kids — specially from the wealthy families who could most easily afford access to them — wouldn't have been caught dead playing with, or at least loving.

Geek, once upon a time, was an unambiguous insult. It was meant to humiliate. Dedicating yourself to certain things meant you'd pay a certain social price. Now, of course, things are better for that particular group; if nothing else, an entire area of intellectual curiosity is no longer stigmatized.

But as our innovation-driven society is locked into computer geeks as the source of change, that means it's going to be completely blindsided by whatever comes next.

Consider J. K. Rowling. Stephenie Meyer. E. L. James. It's significant that you might not recognize the last two names: Meyer wrote Twilight and James Fifty Shades of Grey. Those three women (and it's also significant that they are women) are among the best-selling and most widely influential writers of our time, and pretty much nobody in the publishing industry was even aware that there was a market for what they were doing. Theirs aren't just the standard stories of talented artists struggling to be published. By the standards of the (mostly male) people who ran and by and large still run the publishing industry, the stories they wrote were, if they were to be kind, pointless and low-brow. A school for wizards where people died during a multi-volume malignant cou d'état? The love story of a teenager torn between her possessive werewolf friend and a teenage-looking centuries old vampire struggling to maintain self-control? Romantic sadomasochism from a female point of view?

Who'd read that?

Millions upon millions did. And then they watched the movies, and read the books again. Many of them were already writing the things they wanted to read — James' story was originally fan fiction in the Twilight universe — and wanted more. The publishing industry, supposedly in the business of figuring out that, had ignored them because they weren't a prestigious market (they were women, to be blunt, including very young women who "weren't supposed" to read long books, and older women who "weren't supposed" to care about boy wizards), and those weren't prestigious stories. When it comes to choosing where to go next, industries are as driven by the search for reputation as they are for the search of profit (except finance, where the search for profit regardless of everything else is the basis of reputation). Rowling and Meyer had to convince editors, and James first surge of sales came through self-published Kindle books. The next literary phenomenon might very well bypass publishers, and if that becomes the norm then the question will be what the publishing industry is for.

Going briefly back to the IT industry, gender and race stereotypes are still awfully prevalent. The next J. K. Rowling of software — and there will be one — will have to go through a much more difficult path than she should've had to. On the other hand, a whole string of potential early investors will have painful almost-did-it stories they'll never tell anyone.

This isn't a modern development, but rather a well-established historical pattern. It's the underdogs — the sidelined, the less reputable — who most often come up with revolutionary practices. The "mechanical arts" that we now call engineering were once a disreputable occupation, and no land-owning aristocrat would have guessed that one day they'll sell their bankrupted ancestral homes to industrialists. Rich, powerful Venice began, or so its own legend tells, as a refugee camp. And there's no need to recount the many and ultimately fruitful ways in which the Jewish diaspora adapted to and ultimately leveraged the restrictions imposed everywhere upon them.

Today geographical distances have greatly diminished, and are practically zero when it comes to communication and information. The remaining gap is social — who's paid attention to, and what about.

To put it in terms of a litmus test, if you wouldn't be somewhat ashamed of putting it in a pitch deck, it might be innovative, brilliant, and a future unicorn times ten, but it's something people already sort-of see coming. And a candidate every one of your competitors would consider hiring is one that will most likely go to the biggest or best-paying one, and will give them the kind of advantage they already have. To steal a march on them — to borrow a tactic most famously used by Napoleon, somebody no king would have appointed as a general until he won enough wars to appoint kings himself — you need to hire not only the best of the obvious candidates, but also look at the ones nobody is looking at, precisely because nobody is looking at them. They are the point from which new futures branch.

The next all-caps NEW thing, the kind of new that truly shifts markets and industries, is right now being dreamed and honed by people you probably don't talk to about this kind of thing (or at all) who are doing weird things they'd rather not tell most people about, or that they love discussing but have to go online to find like-minded souls who won't make fun of them or worse.

Diversity isn't just a matter of simple human decency, although it's certainly that as well, and that should be enough. In a world of increasingly AI-driven hyper-corporations that can acquire or reproduce any technological, operational, or logistical innovation anybody but their peer competitors might come up with, it's the only reliable strategy to compete against them. "Black swans" only surprise you if you never bothered looking at the "uncool" side of the pond.

The Differentiable Organization

Neural networks aren't just at the fast-advancing forefront of AI research and applications, they are also a good metaphor for the structures of the organizations leveraging them.

DeepMind's description of their latest deep learning architecture, the Differentiable Neural Computer highlights one of the core properties of neural networks: they are differentiable systems to perform computations. Generalizing the mathematical definition, for a system to be differentiable implies that it's possible to work backwards quantitatively from its current behavior to figure out the changes that should be done to the system to improve it. Very roughly speaking — I'm ignoring most of the interesting details — that's a key component of how neural networks are usually trained, and part of how they can quickly learn to match or outperform humans in complex activities beginning from a completely random "program." Each training round provides not only a performance measurement, but also information about how to tweak the system so it'll perform better the next time.

Learning from errors and adjusting processes accordingly is also how organizations are supposed to work, through project postmortems, mission debriefings, and similar mechanisms. However, for the majority of traditional organizations this is in practice highly inefficient, when at all possible.

  • Most of the details of how they work aren't explicit, but encoded in the organizational culture, workflow, individual habits, etc.
  • They have at best a vague informal model — encoded in the often mutually contradictory experience and instincts of personnel — of how changes to those details will impact performance.
  • Because most of the "code" of the organization is encoded in documents, culture, training, the idiosincratic habits of key personnel, etc, they change only partially, slowly, and with far less control than implied in organizational improvement plans.

Taken together, these limitations — which are unavoidable in any system where operational control is left to humans — make learning organizations almost chimerical. Even after extensive data collection, without a quantitative model of how the details of its activities impact performance and a fast and effective way of changing them, learning remains a very difficult proposition.

By contrast, organizations that have automated low-level operational decisions and, most importantly, have implemented quick and automated feedback loops between their performance and their operational patterns, are, in a sense, the first truly learning organizations in history. As long as their operations are "differentiable" in the metaphorical sense of having even limited quantitative models allowing to work out in a backwards faction desirable changes from observed performance — you'll note that the kind of problems the most advanced organizations have chosen to tackle are usually of this kind, beginning in fact relatively long ago with automated manufacturing — then simply by continuing their activities, even if inefficiently at first, they will be improving quickly and relentlessly.

Compare this pattern with an organization where learning only happens in quarterly cycles of feedback, performed by humans with a necessarily incomplete, or at least heavily summarized, view of low-level operations and the impact on overall performance of each possible low-level change. Feedback delivered to humans that, with the best intentions and professionalism, will struggle to change individual and group behavior patterns that in any case will probably not be the ones with the most impact on downstream metrics.

It's the same structural difference observed between manually written software and trained and constantly re-trained neural networks; the former can perform better at first, but the latter's improvement rate is orders of magnitude higher, and sooner or later leaves them in the dust. The last few years in AI have shown the magnitude of this gap, with software routinely learning in hours or weeks from scratch to play games, identify images, and other complex tasks, going poor or absolutely null performance to, in some cases, surpassing human capabilities.

Structural analogies between organizations and technologies are always tempting and usually misleading, but I believe the underlying point is generic enough to apply: "non-differentiable" organizations aren't, and cannot be, learning organizations at the operational level, and sooner or later aren't competitive with other that set up from the beginning automation, information capture, and the appropriate, automated, feedback loops.

While the first two steps are at the core of "big data" organizational initiatives, the latter is still a somewhat unappreciated feature of the most effective organizations. Rare enough, for the moment, to be a competitive advantage.

When the world is the ad

Data-driven algorithms are effective not because of what they know, but as a function of what they don't. From a mathematical point of view, Internet advertising isn't about putting ads on pages or crafting seemingly neutral content. There's just the input — some change to the world you pay somebody or something to make — and the output — a change in somebody's likelihood of purchasing a given product or voting for somebody. The concept of multitouch attribution, the attempt to understand how multiple contacts with different ads influenced some action, is a step in the right direction, but it's still driven by a cosmology that sees ads as little gems of influence embedded in a larger universe that you can't change.

That's no longer true. The Internet isn't primarily a medium in the sense of something that is between. It's a medium in that we live inside it. It's the atmosphere through which the sound waves of information, feelings, and money flow. It's the spacetime through which the gravity waves from some piece of code shifting from data center to data center according to some post-geographical search of efficiency reach your car to suggest a route. And, on the opposite direction, it's how physical measurements of your location, activities — even physiological state — are captured, shared, and reused in ways that are increasingly more difficult to know about, and much less to be aware of during our daily life. Transparency of action often equals, and is used to achieve, opacity to oversight.

Everything we experience impacts our behavior, and each day more of what we experience is controlled, optimized, configured, personalized — pick your word — by companies desperately looking for a business model or methodically searching for their next billion dollars or ten.

Consider as a harbinger of the future that most traditional of companies, Facebook, a space so embedded in our culture that people older than credit cards (1950, Diners) use it without wonder. Among the constant experimentation with the willingly shared content of our lives that is the company, they ran an experiment attempting to deliberately influence the mood of their users by changing the order of what they read. The ethics of that experiment are important to discuss now and irrelevant to what will happen next, because the business implications are too obvious not to be exploited: some products and services are acquired preferentially by people in a certain mood, and it might be easier to change the mood of an already promising or tested customer than to find another new one.

If nostalgia makes you buy music, why wait until you feel nostalgic to show you an ad, when I can make sure you encounter mentions of places and activities from your childhood? A weapons company (or a law-and-order political candidate) will pay to place their ad next to a crime story, but if they pay more they can also make sure the articles you read before that, just their titles as you scroll down, are also scary ones, regardless of topic. Scary, that is, specifically for you. And knowledge can work just as well, and just as subtly: tracking everything you read, and adapting the text here and there, seemingly separate sources of information will give you "A" and "B," close enough for you to remember them when a third one offers to sell you "C." It's not a new trick, but with ubiquitous transparent personalization and a pervasive infrastructure allowing companies to bid for the right to change pretty much all you read and see, it will be even more effective.

It won't be (just) ads, and it won't be (just) content marketing. The main business model of the consumer-facing internet is to change what they consume, and when it comes down to what can and will be leveraged to do it, the answer is of course all of it.

Along the way, advertising will once again drag into widespread commercial application, as well as public awareness, areas of mathematics and technology currently used in more specialized areas. Advertisers mostly see us — because their data systems have been built to see us — as black boxes with tagged attributes (age, searches, location). Collect enough black boxes and enough attributes, and blind machine learning can find a lot of patterns. What they have barely begun to do is to open up those black boxes to model the underlying process, the illogical logic by which we process our social and physical environment so we can figure out what to do, where to go, what to buy. Complete understanding is something best left to lovers and mystics, but every qualitative change in our scalable, algorithmic understanding of human behavior under complex patterns of stimuli will be worth billions in the next iteration of this arms race.

Business practices will change as well, if only as a deepening of current tendencies. Where advertisers now bid for space on a page or a video slot, they will be bidding for the reader-specific emotional resonance of an article somebody just clicked on, the presence of a given item in a background picture, or the location and value of an item in an Augmented Reality game ("how much to put a difficult-to-catch Pokemon just next to my Starbucks for this person, whom I know has been out in this cold day enough for me to believe it'd like a hot beverage?"). Everything that's controlled by software can be bid upon by other software for a third party's commercial purposes. Not much isn't, and very little won't be.

The cumulative logic of technological development, one in which printed flyers co-exist with personalized online ads, promises the survival of what we might call by then overt algorithmic advertising. It won't be a world with no ads, but one in which a lot of what you perceive is tweaked and optimized so it's collective effect, whether perceived or not, is intended to work as one.

We can hypothesize a subliminally but significantly more coherent phenomenological experience of the world — our cities, friendships, jobs, art — a more encompassing and dynamic version of the "opinion bubbles" social networks often build (in their defense, only magnifying algorithmically the bubbles we had already built with our own choices of friends and activities). On the other hand, happy people aren't always the best customers, so transforming the world into a subliminal marketing platform might end up not being very pleasant, even before considering the impact on our societies of leveraging this kind of ubiquitous, personalized, largely subliminal button-pushing for political purposes.

In any case, it's a race in and for the background, and once that already started.

(Over)Simplifying Calgary too

One of the good side effects of scripting multi-stage pipelines to build a visualization like my over-simplified map of Buenos Aires is that to process a data source in a completely different format only requires you to write a pre-processing script — everything else remains the same.

While I had used CSV data for the Buenos Aires map, I got KML files for the equivalent land use data for the City of Calgary. The pipeline I had written expected use types tied to single points mapped into a fixed grid, so I wrote a small Python script to extract the polygons defined in the KML file, overlay a grid over them, and assign to each grid point the land use value of the polygon that contained id.

After that the analysis was straightforward. Here's the detailed map of land uses (with less resolution than the original data, as the polygons have been projected on the point grid):


Here's the smoothed-out map:


This is how we split it into a puzzle of more-or-less single-use sectors:


And here's how it looks when you forget the geometry and only care about labels and relative (click to read the labels):


Unlike Buenos Aires, I've never been to Calgary, but a quick look at online maps seem to support the above as a first approximation to the city geography. I'd love to hear how from somebody who actually knows the city whether and how it matches their subjective map of the city.

(Over)Simplifying Buenos Aires

This is a very rough sketch of the city of Buenos Aires:

Label sketch of Buenos Aires

As the sketch shows, it's a big blob of homes (VIVIENDAs), with an office-ridden downtown to the East (OFICINAS) and a handful of satellite areas.

The sketch, of course, lies. Here's a map that's slightly less of a lie:

Land usage in Buenos Aires

Both maps are based on the 2011 land usage survey made available by the Open Data initiative of the Buenos Aires city government, more than 555k records assigning each spot to one of about 85 different use regimes. It's still a gross approximation — you could spend a lifetime mapping Buenos Aires, rewrite Ulysses for a porteño Leopold Bloom, and still not really know it — but already one so complex that I didn't add the color key to the map. I doubt anybody will want to track the distribution of points for each of the 85 colors.

Ridiculous as it sounds at first, I'd suggest we are using too much of the second type of graph, and not enough of the first. It's already a commonplace that data visualizations shouldn't be too complex, but I suspect we are overestimating what people wants from a first look at a data set. Sometimes "big blob of homes with a smaller downtown blob due East" is exactly the level of detail somebody needs — the actual shape of the blobs being irrelevant.

The first graph, needless to say, was created programmatically from the same data set from which I graphed the second. It's not a difficult process, and the intermediate steps are useful on their own.

Beginning with the original graph above, you apply something like an smoothing brush to the data points (or a kernel, if you want to sound more mathematical); essentially, you replace the land use tag associated to each point with the majority of the uses in its immediate area, smoothing away the minor exceptions. As you'd expect, it's not that there aren't any businesses in Buenos Aires, it's just that, plot by plot, there are more homes, and when you smooth everything out, it looks more like a blob of homes. This leads to an already much simplified map:

Simplified map of Buenos Aires

Now, one interesting thing about most peoples' sense of space is that it's more topological than metrical, that is, we are generally better at knowing what's next to what than their absolute sizes and positions. Data visualizations should go with the grain of human perceptual and cognitive instincts instead of against them, so one fun next step is to separate the blobs — contiguous blocks of points of the same (smoothed out) land use type — from each other, and show explicitly what's next to what. It looks like this:

Simple nodes

Nodes are scaled non-linearly, and we've filtered out the smaller ones, but we've already done programmatically something that we usually leave to the human looking at a map. We've done a napkin sketch of the city, much as somebody would draw North America as a set of rectangles with the right shared frontiers, but not necessarily much precision in the details. It wouldn't do for a geographical survey, but if you were an extraterrestrial planning to invade Canada, it would provide a solid first understanding of the strategic relevance of Mexico to your plans. From that last map to the first one, it's only a matter of remembering that you don't really care, at this stage, about the exact shape of each blob, just where they stand in relationship to each other. So you replace the blogs with the appropriate land use label, and keep the edges between them. And presto, you have a napkin map.

Yes, on the whole the example is rather pointless. Cities are actually the most over-mapped territories on the planet, at both the formal and informal level. Manhattan is an island, Vatican City is inside Rome, the Thames goes through London... In fact, the London Tube Map has become a cliche example about how to display information about a city in terms of connections instead of physical distance. Not to mention that a simplification process that leaves most of the city as a big blob of homes is certainly ignoring more information that you can afford to, even in an sketch.

Not that we usually do this kind of sketching, at least in our formal work with data. We are almost always cartographers when it comes to new data sets, whether geographical, spatial in a general sense, or just mathematically space-like. We change resolution, simplify colors, resist the temptation of over-using 3D, but keep it a "proper" map. Which is good; the world is complex enough for us not to do the best mapping we can.

However, once you automate the process of creating multiple levels of simplification and sketching as above, you'll probably find yourself at least glancing at the simplest (over)simplifications of your data sets. Probably not for presentations to internal or external clients, but for understanding a complex spatial data set, particularly if it's high-dimensional, beginning with an over-simplified summary and then increasing the complexity is in fact what you're already going to do in your own mind, so why not use the computer to help you out?

ETA: I just posted a similar map of Calgary.

The job of the future isn't creating artificial intelligences, but keeping them sane

Once upon a time, we thought there was such a thing as bug-free programming. Some organizations still do — and woe betide their customers — but after a few decades hitting that particular wall, the profession has by and large accepted that writing software is such an extremely complex intellectual endeavor that errors and unfounded assumptions are unavoidable. Even the most mathematically solid of formal methods has, if nothing else, to interact with a world of unstable platforms and unreliable humans, and what worked today will fail tomorrow.

So we spend time and resources maintaining what we already "finished," fixing bugs as they are found, and adapting programs to new realities as they develop. We have to, because when we don't, as when physical infrastructure isn't maintained, we save resources in the short term, but only in our way towards protracted ruin.

It's no surprise that this also happens with our most sofisticated data-driven algorithms. CVs and scrum boards are filled with references to the maintenance of this or that prediction or optimization algorithm.

But there's a subtle, not universal but still very prevalent, problem: those aren't software bugs. This isn't to say that implementations don't have bugs; being software, they do. But they are computer programs implementing inference algorithms, which work at a higher level of abstraction, and those have their own kinds of bugs, and those don't leave stack traces behind.

A clear example is the experience of Google. PageRank was, without a doubt, among the most influential algorithms in the history of the internet, not to mention the most profitable, but as Google took the internet over by storm, gaming PageRank became such an important business activity that "SEO" became a commonplace word.

From an algorithmic point of view this simply a maintenance problem: PageRank assumed a certain relationship between link structure and relevance, based on the assumption that website creators weren't trying to fool it. Once this assumption became untenable, the algorithm had to be modified to cope with a world of link farms and text written with no human reader in mind.

In (very loosely equivalent) software terms, there was a new threat model, so Google had to figure out and apply a security patch. This is, for any organization facing a simular issue, a continual business-critical process, and one that could make or break a company's profitability (just ask anybody working on high-frequency trading). But not all companies deploy the same sort of detailed, continuous instrumentalization, and development and testing methodologies that they use to monitor and fix their software systems to their data driven algorithms independently of their implementations. The same data scientist who developed an algorithm is often in charge of monitoring its performance on a more or less regular basis; or, even worse, it's only a hit to business metrics what makes companies reassingn their scarce human resources towards figuring out what's going wrong. Either monitoring and maintenance strategy would amount to criminal malpractice if we were talking about software, yet there are companies for which is this is the norm.

Even more prevalent is the lack of automatic instrumentalization for algorithms mirroring that for servers. Any organization with a nontrivial infrastructure is well aware of, and has analysis tools and alarms for, things like server load or application errors. There are equivalent concepts for data-driven algorithms — quantitative statistical assumptions, wildly erroneous predictions — that should, also, be monitored in real time, and not collected (when the data is there) by a data scientist only after the situation has become bad enough to be noticed.

None of this is news to anybody working with big data, particularly in large organizations centered around this technology, but we have still to settle on a common set of technologies and practices, and even just on an universal agreement on its need.

These days nobody would dare deploy a web application trusting only server logs at the operating system level. Applications have their own semantics, after all, and everything in the operating system working perfectly is no guarantee that the app is working at all.

Large-scale prediction and optimization algorithms are just the same; they are often an abstraction running over the application software that implements them. They can be failing wildly, statistical assumptions unmet and parameters converging to implausible values, with nothing in the application layer logging even a warning of any kind.

Most users forgive a software bug much more easily than unintelligent behavior in avowedly intelligent software. As a culture, we're getting used to the fact that software fails, but many still buy the premise that artificial intelligence doesn't (this is contradictory, but so are all myths). Catching these errors as early as possible can only be done while algorithms are running in the real world, where the weird edge cases and the malicious users are, and this requires metrics, logs, and alarms that speak of what's going on in the world of mathematics, not software.

We haven't converged yet on a standard set of tools and practices for this, but I know many people who'll sleep easier once we have.

The future of machine learning lies in its (human) past

Superficially different in goals and approach, two recent algorithmic advances, Bayesian Program Learning and Galileo, are examples of one of the most interesting and powerful new trends in data analysis. It also happens to be the oldest one.

Bayesian Program Learning (BPL) is deservedly one of the most discussed modeling strategies of recent times, matching or outperforming both humans and deep learning models in one-shot handwritten character classification. Unlike many recent competitors, it's not a deep learning architecture. Rather (and very roughly) it understands handwritten characters as the output of stochastic programs that join together different graphical parts or concepts to generate versions of each character, and seeks to synthesize them by searching through the space of possible programs.

Galileo is, at first blush, a different beast. It's a system designed to extract physical information about the objects in an image or video (e.g., their movements), coupling a deep layer module with a 3D physics engine which acts as a generative model.

Although their domains and inferential algorithms are dissimilar, the common trait I want to emphasize is that they both have at their core domain-specific generative models that encode sophisticated a priori knowledge about the world. The BPL example knows implicitly, through the syntax and semantics of the language of its programs, that handwritten characters are drawn using one or more continuous strokes, often joined; an standard deep learning engine, beginning from scratch, would have to learn this. And Galileo leverages a proper, if simplified, 3D physics engine! It's not surprising that, together with superb design and engineering, these models show the performance they do.

This is how all cognitive processing tends to work in the wider world. We are fascinated, and of course how could we not be?, by how much our algorithms can learn from just raw data. To be able to obtain practical results in multiple domains is impressive, and adds to the (recent, and, like all such things, ephemeral) mystique of the data science industry. But the fact is that no successful cognitive entity starts from scratch: there is a lot about the world that's encoded in our physiology (we don't need to learn to pump our blood faster when we are scared; to say that evolution is a highly efficient massively parallel genetic algorithm is a bit of a joke, but also true, and what it has learned is encoded in whatever is alive, or it wouldn't be).

Going to the other end of the abstraction scale, for all of the fantastically powerful large-scale data analysis tools physicists use and in many cases depend on, the way even basic observations are understood is based on centuries of accumulated (or rather constantly refined) prior knowledge, encoded in specific notations, theories, and even theories about how theories can look like. Unlike most, although not all, industrial applications, data analysis in science isn't a replacement of explicitly codified abstract knowledge, but rather stands on its gigantic shoulders.

In parallel to continuous improvement in hardware, software engineering, and algorithms, we are going to see more and more often the deployment of prior domain knowledge as part of data science implementations. The logic is almost trivial: we have so much knowledge accumulated about so many things, that any implementation that doesn't leverage whatever is known in its domain is just not going to be competitive.

Just to be clear, this isn't a new thing, or a conceptual breakthrough. If anything, it predates the take the data and model it approach that's most popularly seen as "data science," and almost every practitioner, many of them coming from backgrounds in scientific research, is aware of it. It's simply that now our data analysis tools have become flexible and powerful for us to apply it with increasingly powerful results.

The difference in performance when this can be done, as I've seen in my own projects and is obvious in work like BPL and Galileo, has always been so decisive that doing things in any other way soon becomes indefensible except on grounds of expediency (unless of course you're working in a domain that lacks any meaningful theoretical knowledge... a possibility that usually leads to interesting conversations with the domain experts).

The cost is that it does shift significantly the way in which data scientists have to work. There are already plenty of challenges in dealing with the noise and complexities of raw data, before you start considering the ambiguities and difficulties of encoding and leveraging sometimes badly misspecified abstract theories. Teams become heterogeneous at a deeper level, with domain experts — many of them with no experience in this kind of task — not only validating the results and providing feedback, but participating actively as sources of knowledge from day one. Projects take longer. Theoretical assumptions in the domain become explicit, and therefore design discussions take much longer.

And so on and so forth.

That said, the results are very worth it. If data science is about leveraging the scientific method for data-driven decision-making, it behooves us to always remember that step zero of the scientific method is to get up to date, with some skepticism but with no less dedication, on everything your predecessors figured out.

The truly dangerous AI gap is the political one

The main short term danger from AI isn't how good it is, or who's using it, but who isn't: governments.

This impacts every aspect of our interaction with the State, beginning with the ludicrous way in which we have to move papers around (at best, digitally) to tell one part of the government something another part of the government already knows. Companies like Amazon, Google, or Facebook are built upon the opposite principle. Every part of them knows everything any part of the company knows about you (or at least it behaves that way, even if in practice there are still plenty of awkward silos).

Or consider the way every business and technical process is monitored and modeled in a high-end contemporary company, and contrast it with the opacity, most damagingly to themselves, of government services. Where companies strive to give increasingly sophisticated AI algorithms as much power as possible, governments often struggle to give humans the information they need to make the decisions, much less assist or replace them with decision-making software.

It's not that government employees lack the skills or drive. Governments are simply, and perhaps reasonably, biased toward organizational stability: they are very seldom built up from scratch, and a "fail fast" philosophy would be a recipe for untold human suffering instead of just a bunch of worthless stock options. Besides, most of the countries with the technical and human resources to attempt something like this are currently leaning to one degree or another towards political philosophies that mostly favor a reduced government footprint.

Under these circumstances, we can only expect the AI gap between the public and the private sector to grow.

The only areas where this isn't the case are, not coincidentally, the military and intelligence agencies, who are enthusiastic adopters of every cutting edge information technology they can acquire or develop. But these exceptions only highlight one of the big problems inherent in this gap: intelligence agencies (and to a hopefully lesser degree, the military) are by need, design, or their citizens' own faith the government areas least subject to democratic oversight. Private companies lose money or even go broke and disappear if they mess up; intelligence agencies usually get new top-level officers and a budget increase.

As an aside, even individuals are steered away from applying AI algorithms instead of consuming their services, through product design and, increasingly, laws that prohibit them from reprogramming their own devices with smarter or at least more loyal algorithms.

This is a huge loss of potential welfare — we are getting worse public services, and at a higher cost, than we could given the available technology — but it's also part of a wider political change, as (big) corporate entities gain operational and strategic advantages that shift the balance of power away from democratically elected organizations. It's one thing for private individuals to own the means of production, and another when they (and often business-friendly security agencies) have a de facto monopoly on superhuman smarts.

States originally gained part of their power through early and massive adoption of information technologies, from temple inventories in Summer to tax censuses and written laws. The way they are now lagging behind bodes ill for the future quality of public services, and for democratic oversight of the uses of AI technologies.

It would be disingenuous to say that this is the biggest long- and not-so-long-term problem states are facing, but only because there are so many other things going wrong or still to be done. But it's something that will have to be dealt with; not just with useful but superficial online access to existing services, or with the use of internet media for public communication, but also with deep, sustained investment in the kind of ubiquitous AI-assisted and AI-delegated operations that increasingly underlie most of the private economy. Politically, organizationally, and culturally as near-impossible as this might look.

The recently elected Argentinean government has made credible national statistics one of its earliest initiatives, less an act of futuristic boldness than a return to the 20th century baseline of data-driven decision-making, a departure of the previous government that was not without large political and practical costs. By failing to resort intensively to AI technologies in their public services, most governments in the world are failing to measure up to the technological baseline of the current century, an almost equally serious oversight.

"Prior art" is just a fancy term for "too slow lawyering up"

They used to send a legal ultimatum before it happened. Now you just wake up one day and everything green is dead, because the plants are biotech and counter-hacking is a legal response to intellectual property theft, even if the genes in question are older than the country that granted the patent.

My daughter isn't looking at the rotting remains of her flower garden. Her eyes are locked into mine, with the intensity of a child too young not to take the world seriously. Are we going to jail?

No, I say, and smile. They only go personally after the big ones; for small people like us this destruction suffices.

She nods. Am I going to die?

I kneel and hug her. No, of course not, I say, with every bit of certainty I can muster. There's nothing patented in you, I want to add, but she's old enough to know that'd be a lie.

I feel her chest move and I realize she had been holding her breath. We stay together, just breathing. The air is filled with legal pathogens looking for illegal things to kill.


The gig economy is the oldest one, and it's always bad news

Let's say you have an spare bedroom and you need some extra income. What do you do? You do more of what you've trained for, in an environment with the capital and tools to do it best. Anything else only makes sense if the economy is badly screwed up.

The reason is quite simple: unless you work in the hospitality industry, you are better — able to extract from it a higher income — at doing whatever else you're doing than you are at being a host, or you wouldn't take it up as a gig, but rather switch to it full time. Suppliers in the gig economy (as opposed to professionals freelancing in their area of expertise) are by definition working more hours but less efficiently so, whether because they don't have the training and experience, or because they aren't working with the tools and resources they'd take advantage of in their regular environments. The cheaper, less quality, badly regulated service they provide might be desirable to many customers, but this is achieved partly through de-capitalization. Every hour and dollar an accountant spends caring for a guest instead of, if he wants a higher income, doing more accounting or upgrading his tools, is a waste of his knowledge. From the point of view of overall capital and skill intensity, a professional low-budget hotel chain would be vastly more efficient over the long term (of course, to do that you need to invest capital in premises and so on instead of on vastly cheaper software and marketing).

The only reason for an accountant, web designer, teacher, or what not, for doing "gigs" instead of extra hours, freelance work, or similar, is if there is no demand for their professional labor. While it's entirely possible that overtime or freelance work might be relatively less valuable than the equivalent time spent at their main job, to do something else they would have to get less than what they can get from a gig for which they have little training and few tools. That's not how a capital- and skill-intensive economy looks like.

For an specific occupation falling out of favor, this is just the way of things. For wide swaths of the population to find themselves in this position, perhaps employed but earning less than they would like, and unable to trade more of their specialized labor for income, the economy as a whole has to be suffering from depressed demand. What's more, they still have to contend with competitors with more capital but still looking to avoid regulations (e.g., people buying apartments specifically to rent via Airbnb), in turn lowering their already low gig income.

This is a good thing if you want cheaper human-intensive services or have invested on Airbnb and similar companies, and it's bad news if you want an skill-intensive economy with proportionally healthy incomes.

In the context of the gig economy, flexibility is an euphemism for I have a (perhaps permanent!) emergency and can't get extra work, and efficiency refers to the liquidity of services, not the outcome of high capital intensity. And while renting a room or being an Uber driver might be less unpleasant than, and downright utopian compared to, the alternatives open to those without a room to rent or an adequate car, the argument that it's fun doesn't survive the fact that nobody has ever been paid to go and crash on other people's couches.

Neither Airbnb nor Uber are harmful on themselves — who doesn't think cab services could use more a transparent and effective dispatch system? — but customer ratings don't replace training, certification, and other forms of capital investment. Shiny apps and cool algorithms aside, a growing gig economy is a symptom of an at least partially de-skilling one.

The Man Who Was Made A People

Gregory has two million evil twins. None of them is a person, but why would anybody care?

They are everywhere except in the world. They search the web, click on ads, make purchases, create profiles, favorite things, post comments. Being bots, they don't sleep or work; they do nothing but what they were programmed to do, hidden deep in some endless pool of stolen computing power they have been planted in like dragon's teeth.

They are him. Their profiles carry his name, his location, his interests, or variations close enough to be indistinguishable to even the most primitive algorithm. The pictures posted by the bots are all of men very similar to Gregory in skin tone, clothes, cellphone, car. And he knows they are watching him, because when he changes how he looks, they change as well.

They are evil. Most of their online activities are subtle mirrors of his own, but some deal with topics and people that most find abhorrent, and none more than himself. Violence, depravity, every form of hate and crime, and — worst of all — every statistically known omen of future violence and crime.

Driven by the blind genius of predictive algorithms, sites show Gregory increasingly dark things to look at and buy, and suggest friendships with unbalanced bigots of every kind. His credit score has crumbled. Journalism gigs are becoming scarce. Cops scowl as they follow him with eyes covered by smart glasses, one hand on their guns and the other on their radios. He no longer bothers to check his dating profile; the messages he gets are more disturbing than the replies he no longer does.

He has begun to go out less, to use the web through anonymizing services, to take whatever tranquilizers he can afford. All of those are suspicious activities on their own, he knows, but what choice does he have? He spends his nights trying to figure out who or what he offended enough to have this all-too-real curse laid upon him. The list of possibilities is too large, what journalist's isn't?, and he's not desperate enough to convince himself there's any point to seeking forgiveness. He's scared that one day he might be.

Gregory knows how this ends. He has begun to click on links he wouldn't have. Some of the searches are his. Every night he talks himself out of buying a gun. So far.

He has begun to feel there are two million of him.


Bitcoin is Steampunk Economics

From the point of view of its largest financial backers, the fact that Bitcoin combines 21st century computer science with 17th century political economy isn't an unfortunate limitation. It's what they want it for.

We have grown as used to the concept of money as to any other component of our infrastructure, but, all things considered, it's an astoundingly successful technology. Even in its simplest forms it helps solve the combinatorial explosion implicit in any barter system, which is why even highly restricted groups, like prison populations, implement some form of currency as one of the basic building blocks of their polities.

Fiat money is a fascinating iteration of this technology. It doesn't just solve the logistical problems of carrying with you an impractical amount of shiny metals or some other traditional reference commodity, it also allows a certain degree of systemic adaptation to external supply and demand shocks, and pulls macroeconomic fine-tuning away from the rather unsuitable hands of mine prospectors and international trading companies.

A protocol-level hack that increases systemic robustness in a seamless distributed manner: technology-oriented people should love this. And they would, if only that hack weren't, to a large degree... ugh... political. From the point of view of somebody attempting to make a ton of money by, literally, making a ton money, the fact that a monetary system is a common good managed by a quasi-governmental centralized organization isn't a relatively powerful way to dampen economic instabilities, but an unacceptable way to dampen their chances of making said ton of money.

So Bitcoin was specifically designed to make this kind of adjustment impossible. In fact, the whole, and conceptually impressive, set of features that characterize it as a currency, from the distributed ledger to the anonymity of transfers to the mathematically controlled rate of bitcoin creation, presupposes that you can trust neither central banks nor financial institutions in general. It's a crushingly limited fallback protocol for a world where all central banks have been taken over by hyperinflation-happy communists.

The obvious empirical observation is that central banks have not been taken over by hyperinflation-happy communists. Central banks in the developed world have by and large mastered the art of keeping inflation low – in fact, they seem to have trouble doing anything else. True, there are always Venezuelas and Argentinas, but designing a currency based on the idea that they are at the cutting edge of future macroeconomic practice crosses the line from design fiction to surrealist engineering.

As a currency, Bitcoin isn't the future, but the past. It uses our most advanced technology to replicate the key features of an obsolete concept, adding some Tesla coils here and there for good effect. It's gold you can teleport; like a horse with an electric headlamp strapped to its chest, it's an extremely cool-looking improvement to a technology we have long superseded.

As computer science, it's magnificent. As economics, it's an steampunk affectation.

Where bitcoin shines, relatively speaking, is in the criminal side of the e-commerce sector — including service-oriented markets like online extortion and sabotage — where anonymity and the ability to bypass the (relative) danger of (nominally, if not always pragmatically) legal financial institutions are extremely desirable features. So far Bitcoin has shown some promise not as a functional currency for any sort of organized society, but in its attempt to displace the hundred dollar bill from its role as what one of William Gibson's characters accurately described as the international currency of bad shit.

This, again, isn't an unfortunate side effect, but a consequence of the design goals of Bitcoin. There's no practical way to avoid things like central bank-set interest rates and taxes, without also avoiding things like anti-money laundering regulations and assassination markets. If you mistrust government regulations out of principle and think them unfixable through democratic processes — that is, if you ignore or reject political technologies developed during the 20th century that have proven quite effective when well implemented — then this might seem to you a reasonable price to pay. For some, this price is actually a bonus.

There's nothing implicit in contemporary technologies that justifies our sometimes staggering difficulties managing common goods like sustainably fertile lands, non-toxic water reservoirs, books written by people long dead, the antibiotic resistance profile of the bacteria whose planet we happen to live in, or, case in point, our financial systems. We just seem to be having doubts as to whether we should, doubts ultimately financed by people well aware that there are a few dozen deca-billion fortunes to be made by shedding the last two or three centuries' worth of political technology development, and adding computationally shiny bits to what we were using back then.

Bitcoin is a fascinating technical achievement mostly developed by smart, enthusiastic people with the best of intentions. They are building ways in which it, and other blockchain technologies like smart contracts, can be used to make our infrastructures more powerful, our societies richer, and our lives safer. That most of the big money investing in the concept is instead attempting to recreate the financial system of late medieval Europe, or to provide a convenient complement to little bags of diamonds, large bags of hundred dollar bills, and bank accounts in professionally absent-minded countries, when they aren't financing new and excitingly unregulated forms of technically-not-employment, is completely unexpected.

The price of the Internet of Things will be a vague dread of a malicious world

Volkswagen didn't make a faulty car: they programmed it to cheat intelligently. The difference isn't semantics, it's game-theoretical (and it borders on applied demonology).

Regulatory practices assume untrustworthy humans living in a reliable universe. People will be tempted to lie if they think the benefits outweigh the risks, but objects won't. Ask a person if they promise to always wear their seat belt, and the answer will be at best suspect. Test the energy efficiency of a lamp, and you'll get an honest response from it. Objects fail, and sometimes behave unpredictably, but they aren't strategic, they don't choose their behavior dynamically in order to fool you. Matter isn't evil.

But that was before. Things now have software in them, and software encodes game-theoretical strategies as well as it encodes any other form of applied mathematics, and the temptation to teach products to lie strategically will be as impossible to resist for companies in the near future as it has been to VW, steep as their punishment seems to be. As it has always happened (and always will) in the area of financial fraud, they'll just find ways to do it better.

Environmental regulations are an obvious field for profitable strategic cheating, but there are others. The software driving your car, tv, or bathroom scale might comply with all relevant privacy regulations, and even with their own marketing copy, but it'll only take a silent background software upgrade to turn it into a discrete spy reporting on you via well-hidden channels (and everything will have its software upgraded all the time; that's one of the aspects of the Internet of Things nobody really likes to contemplate, because it'll be a mess). And in a world where every device interacts with and depends on a myriad others, devices from one company might degrade the performance of a competitor's... but, of course, not when regulators are watching.

The intrinsic challenge to our legal framework is that technical standards have to be precisely defined in order to be fair, but this makes them easy to detect and defeat. They assume a mechanical universe, not one in which objects get their software updated with new lies every time regulatory bodies come up with a new test. And even if all software were always available, cheking it for unwanted behavior would be unfeasible — more often than not, programs fail because the very organizations that made them haven't or couldn't make sure it behaved as they intended.

So the fact is that our experience of the world will increasingly come to reflect our experience of our computers and of the internet itself (not surprisingly, as it'll be infused with both). Just as any user feels their computer to be a fairly unpredictable device full of programs they've never installed doing unknown things to which they've never agreed to benefit companies they've never heard of, inefficiently at best and actively malignant at worst (but how would you now?), cars, street lights, and even buildings will behave in the same vaguely suspicious way. Is your self-driving car deliberately slowing down to give priority to the higher-priced models? Is your green A/C really less efficient with a thermostat from a different company, or it's just not trying as hard? And your tv is supposed to only use its camera to follow your gestural commands, but it's a bit suspicious how it always offers Disney downloads when your children are sitting in front of it.

None of those things are likely to be legal, but they are going to be profitable, and, with objects working actively to hide them from the government, not to mention from you, they'll be hard to catch.

If a few centuries of financial fraud have taught us anything, is that the wages of (regulatory) sin are huge, and punishment late enough that organizations fall into temptation time and again, regardless of the fate of their predecessors, or at least of those who were caught. The environmental and public health cost of VW's fraud is significant, but it's easy to imagine industries and scenarios where it'd be much worse. Perhaps the best we can hope for is that the avoidance of regulatory frameworks on Internet of Things won't have the kind of occasional systemic impact that large-scale financial misconduct has accustomed us to.

We aren't uniquely self-destructive, just inexcusably so

Natural History is an accretion of catastrophic side effects resulting from blind self-interest, each ecosystem an apocalyptic landscape to the previous generations and a paradise to the survivors' thriving and well-adapted descendants. There was no subtle balance when the first photosynthetic organisms filled the atmosphere with the toxic waste of their metabolism. The dance of predator and prey takes its rhythm from the chaotic beat of famine, and its melody from an unreliable climate. Each biological innovation changes the shape of entire ecosystems, giving place to a new fleeting pattern than will only survive until the next one.

We think Nature harmonious and wise because our memories are short and our fearful worship recent. But we are among the first generations of the first species for which famine is no accident, but negligence and crime.

No, our destruction of the ecosystems we were part of when we first learned the tools of fire, farm, and physics is not unique in the history of our planet, it's not a sin uniquely upon us.

It is, however, a blunder, because we know better, and if we have the right to prefer to a silent meadow the thousands fed by the farms replacing it, we have no right to ignore how much water it's safe to draw, how much nitrogen we will have to use and where it'll come from, how to preserve the genes we might need and the disease resistance we already do. We made no promise to our descendants to leave them pandas and tigers, but we will indeed be judged poorly if we leave them a world changed by the unintended and uncorrected side effects of our own activities in ways that will make it harder for them to survive.

We aren't destroying the planet, couldn't destroy the planet (short of, in an ecological sense, sterilizing it with enough nuclear bombs). What we are doing is changing its ecosystems, and in some senses its very geology and chemistry, in ways that make it less habitable for us. Organisms that love heat and carbon in the air, acidic seas and flooded coasts... for them we aren't scourges but benefactors. Biodiversity falls as we change the environment with a speed, in an evolutionary scale, little slower than a volcano's, but the survivors will thrive and then radiate in new astounding forms. We may not.

Let us not, then, think survival a matter of preserving ecosystems, or at least not beyond what an aesthetic or historical sense might drive us to. We have changed the world in ways that make it worse for us, and we continue to do so far beyond the feeble excuses of ignorance. Our long term survival as a civilization, if not as a species, demands from us to change the world again, this time in ways that will make it better for us. We don't need biodiversity because we inherited it: we need it because it makes ecosystems more robust, and hence our own societies less fragile. We don't need to both stop and mitigate climate change because there's something sacred about the previous global climate: we need to do it because anything much worse than what we've already signed for might be too much for our civilization to adapt to, and runaway warming might even be too much for the species itself to survive. We need to understand, manage, and increase sustainable cycles of water, soil, nitrogen, and phosphorus because that's how we feed ourselves. We can survive without India's tigers. But collapse the monsoon or the subcontinent's irrigation infrastructure and at least half a billion people will die.

We wouldn't be the first species killed by our own blind success, nor the first civilization destroyed by a combination of power and ignorance, empty cities the only reminders of better architectural than ecological insight. We know better, and should act in a way befitting what we know. Our problem is no larger than our tools, our reach no further than our grasp.

The only question is how hard we'll make things for us before we start working on earnest to build a better world, one less harsh to our civilization, or at least not untenably more so. The question is how many people will unnecessarily die, and what long-term price we'll pay for our delay.

The Girl and the Forest

The girl is crossing a frontier that exists only in databases. Her phone whispers frantically on her ear: crossing such a frontier triggers no low-priority notification, but the digital panic merited by a lethal navigational mishap. Cross a line between two indistinguishable plots of land and you become the legitimate target of automated guns, or an illegal person to be sent to a private working prison, or any number of other fates perhaps but not certainly worse than what you were leaving behind.

The frontier the girl is crossing separates a water-poor region from a barren desert, the invisible line a temporary marker of the ever-faster retreat of agricultural mankind. The region reacts to unwanted strangers with less robots but as much heavily armed dedication as any of the richer ones. But the girl is walking into the desert, and there are no defensive systems on her way. There is just the dead sand.

She doesn't carry enough water or food to get her to the other side.

* * *

The girl went to a hidden net site a friend had shared with her with the electronic whispers and half-incredulous sniggers other generations had reserved for the mysteries of sex. Sex wasn't much of a mystery to their generation, who had seen everything long before it could be understood with anything except her mind. But they had never been in a forest, and almost none of them ever would. They traded pictures and descriptions of how the desert looked before it was a desert, and tried to imagine the smell of a thousand acres of shadowed damp earth. It was a fad, for most. A phase. Youth and nostalgia are mutually incompatible states.

Yet for some their dreams of forests endured: they had uncovered something, a secret, found because they weren't welcomed into the important matters reserved for grownups. Inside the long abandoned monitoring network their parents' generation had used to attempt to manage the retreating forest, some of the sensors were still alive. Most of them were repeating a monotonous prayer of heat and sand to creators too ashamed of their failure to let themselves look back.

But some of the sensors chanted of water, and shadow, and biomass. The girl had seen the data in her phone, and half-felt a breeze of leaves and bark. What if satellite pictures showed a canyon that, yes, could be safe from the soil-stealing wind, but was as barren as everything else? What of that?

The girl thought of her parents, and of the child she had promised herself she wouldn't give to the barren earth, and with guilt that didn't slow her down, she took the least amount of water she thought would be enough to get her to the canyon, and went into the desert.

The dull sleepless intelligences inside the border cameras saw her leave, but would only alert a human if they saw her walking back.

* * *

The girl will barely reach the canyon, half-dying, clinging to her last bit of water as a talisman. There will be no forest there, nothing in the canyon but dry sand. But in the small caves between the rocks, where the geometry of stones has built small enclosed worlds of darkness, she'll find ugly, malevolently tenacious, and very much alive mushrooms, and around them the clothes of those who will have reached the canyon before her. Most of their clothes will be of her size.

The girl will understand. She won't drink the last of her water, but give it to the mushrooms. Then she will lie down and close her eyes, and fall sleep in the shadow, surrounded by a forest at last.


Asking a Shadow to Dance

Isomorphic is a mathematical term: it means of the same shape. This is a lie.

Every morning you wake up in the apartment you might have bought if you hadn't been married (but you were, and those identical apartments are not the same). Your car takes you through the same route you would have taken, to an office where you look into the blankness of a camera and the camera looks back. You see nothing. The camera sees the pattern of blood vessels on the back of your eyes, and opens your computer for you.

The interface you see is always the same, just patterns of changing data devoid of context. Patterns that a combination of raw genetic luck and years of training has made you flawlessly adept at understanding and controlling. The pattern on your screen changes five times each second. Faster than that, you move your fingers in a precise way, the skill etched in your muscles as much as in your brain. The pattern and your fingers dance together, and the dance makes the pattern stay in a shape that has no meaning outside itself. You have received almost every commendation they can give to someone doing your job. Only the man on the other side of the table has more. You have never seen his screen, and he has never seen yours.

The inertial idiocy of that security rule is sickening in its redundancy. You couldn't know what he's doing from the data on his screen any more than you can know what you are doing from what you see in yours. Sometimes you think you're piloting a drone swarm. Sometimes you're defending an infrastructure network, or attacking one. Twice you have felt a rhythm in the patterns almost like a heart, and wondered if you were killing somebody through some medical device.

But you don't know. That's the point. Whatever you could be doing, the shape of the data on your screen would be the same, all the necessary information to control, damage, defend, or kill, but scrubbed of all meaning tying it back to the real world. Isomorphism, the instructors called it.

But that's a lie. It's not the same, and it could never be.

You begin to lose sleep. Twice the camera on your computer has to learn a new pattern for the blood behind your eyes. Your performance doesn't suffer; the parts of your mind and body that do the work are not the ones grappling with a guilt larger because it's undefined. Your nightmares are shapeless: you dream of data and wake up unable to breath.

One day you finally allow yourself to know that the man across the table enjoys his work. Always had. You had ignored him all those years, him and everything not in the data, but now you look at him with a wordless how? He makes a gesture with his head, come and see. An isomorphism that scrubs the data not only of meaning but also guilt.

You need it so much that you don't stop to think about the rules you're both breaking under the gaze of the security cameras. You just go around the table and look at his screen.

There's no isomorphism. There's nothing but truth, and you can neither watch nor stop watching. His fingers are dancing and his smile is joyful and he has always known what he was doing. And now you can, too.

You scramble back to your screen in blind haste. The patterns of data are innocent, you tell yourself, of everything you saw on that other screen, and so if the dance of your fingers. They just have the same shape, that's all.

You work efficiently as ever. You wonder if you'll go crazy, and fear you won't, and know that neither act will change your shape.


The Telemarketer Singularity

The future isn't a robot boot stamping on a human face forever. It's a world where everything you see has a little telemarketer inside them, one that knows everything about you and never, ever, stops selling things to you.

In all fairness, this might be an slight oversimplification. Besides telemarketers, objects will also be possessed by shop attendants, customer support representatives, and conmen.

What these much-maligned but ubiquitous occupations (and I'm not talking here about their personal qualities or motivations; by and large, they are among the worst exploited and personally blameless workers in the service economy) have in common is that they operate under strict and explicitly codified guidelines that simulate social interaction in order to optimize a business metric.

When a telemarketer and a prospect are talking, of course, both parties are human. But the prospect is, however unconsciously, guided by a certain set of rules about how conversations develop. For example, if somebody offers you something and you say no, thanks, the expected response is for that party to continue the conversation under the assumption that you don't want it, and perhaps try to change your mind, but not to say ok, I'll add it to your order and we can take it out later. The syntax of each expression is correct, but the grammar of the conversation as a whole is broken, always in ways specifically designed to manipulate the prospect's decision-making process. Every time you have found yourself talking on the phone with a telemarketer, or interacting with a salesperson, far longer than you wanted to, this was because you grew up with certain unconscious rules about the patterns in which conversations can end — and until they make the sell, they will neither initiate nor acknowledge any of them. The power isn't in their sales pitch, but in the way they are taking advantage of your social operating system, and the fact that they are working with a much more flexible one.

Some people, generally described by the not always precise term sociopath, are just naturally able to ignore, simulate, or subvert these underlying social rules. Others, non-sociopathic professional conmen, have trained themselves to be able to do this, to speak and behave in ways that bypass or break our common expectations about what words and actions mean.

And then there are telemarketers, who these days work with statistically optimized scripts that tell them what to say in each possible context during a sales conversation, always tailored according to extensive databases of personal information. They don't need to train themselves beyond being able to convey the right emotional tone with their voices: they are, functionally, the voice interface of a program that encodes the actual sales process, and that, logically, has no need to conform to any societal expectation of human interaction.

It's tempting to call telemarketers and their more modern cousins, the computer-assisted (or rather computer-guided) sales assistants, the first deliberately engineered cybernetic sociopaths, but this would miss the point that what matters, what we are interacting with, isn't a sales person, but the scripts behind them. The person is just the interface, selected and trained to maximize the chances that we will want to follow the conversational patterns that will make us vulnerable to the program behind.

Philosophers have long toyed with a mental experiment called the Chinese Room: There is a person inside a room who doesn't know Mandarin, but has a huge set of instructions that tells her what characters to write in response to any combination of characters, for any sequence of interactions. The person inside doesn't know Mandarin, but anybody outside who does can have an engaging conversation by slipping messages under the door. The philosophical question is, who is the person outside dialoging with? Does the woman inside the room know Mandarin in some sense? Does the room know?

Telemarketers are Chinese Rooms turned inside-out. The person is outside, and the room is hidden from us, and we aren't interacting socially with either. We only think we do, or rather, we subconsciously act as if we do, and that's what makes cons and sales much more effective than, rationally, they should be.

We rarely interact with salespeople, but we interact with things all the time. Not because we are socially isolated, but because, well, we are surrounded by things. We interact with our cars, our kitchens, our phones, our websites, our bikes, our clothes, our homes, our workplaces, and our cities. Some of them, like Apple's Siri or the Sims, want us to interact with them as if they were people, or at least consider them valid targets of emotional empathy, but what they are is telemarketers. They are designed, and very carefully, to take advantage of our cultural and psychological biases and constraints, whether it's Siri's cheerful personality or a Sim's personal victories and tragedies.

Not every thing offer us the possibility of interacting with them as if they were human, but that doesn't stop them from selling to us. Every day we see the release of more smart objects, whether it's consumer products or would-be invisible pieces of infrastructure. Connected to each other and to user profiling databases, they see us, know us, and talk to each and to their creators (and to their creators' "trusted partners," who aren't necessarily anybody you have even heard of) about us.

And then they try to sell us things, because that's how the information economy seems to work in practice.

In some sense, this isn't new. Expensive shoes try to look cool so other people will buy them. Expensive cars are in a partnership with you to make sure everybody knows how awesome they make you look. Restaurants hope that some sweet spot of service, ambiance, food, and prices will make you a regular. They are selling themselves, as well as complementary products and services.

But smart objects are a qualitatively different breed, because, being essentially computers with some other stuff attached to them, what their main function is might not be what you bought them for.

Consider an internet-connected scale that not only keeps track of your weight, but also sends you through a social network congratulatory messages when you reach a weight goal. From your point of view, it's just a scale that has acquired a cheerful personality, like a singing piece of furniture in a Disney movie, but from the point of view of the company that built and still controls it, it's both a sensor giving them information about you, and a way to tell you things you believe are coming from something – somebody who knows you, in some ways, better than friends and family. Do you believe advertisers won't know whether to sell you diet products or a discount coupon in the bakery around the corner from your office? Or, even more powerfully, that your scale won't tell you You have earned yourself a nice piece of chocolate cake ;) if the bakery chain is the one who purchased that particular "pageview?"

Let's go to the core of advertising: feelings. Much of the Internet is paid for by advertisers' belief that knowing your internet behavior will tell them how you're feeling and what you're interested on, which will make it easier to sell things to you. Yet browsing is only one of the things we do that computers know about in intimate detail. Consider the increasing number of internet-connected objects in your home that are listening to you. Your phone is listening for your orders, but that doesn't mean it's all it's listening for. The same goes for your computer, your smart TV (some of which are actually looking at you as well), even some children's dolls. As the Internet of Things grows way beyond the number of screens we can deal with, or apps we are willing to use to control them, voice will become the user interface of choice, just like smartphones overtook desktop computers. That will mean that possibly dozens of objects, belonging to a handful of companies, will be listening to you and selling that information to whatever company pays enough to become a "trusted partner." (Yes, this is and will remain legal. First, because we either don't read EULAs or do and try not to think about them. And second, because there's no intelligence agency in the planet who won't lobby to keep it legal.)

Maybe they won't be reporting everything you say verbatim, that will depend on how much external scrutiny there is on the industry, but your mood (did you yell at your car today, or sang aloud as you drove?), your movements, the time of the day you wake up, which days you cook and which days you order takeout? Everybody trying to sell things to you will know all of this, and more.

That will be just an extension of the steady erosion of our privacy, and even of our expectation of it. More delicate will be the way in which our objects will actively collaborate in this sales process. Your fridge's recommendations when you run out of something might be oddly restricted to a certain brand, and if you never respond to them, shift to the next advertiser with the best offer — that is, the most profitable for whoever is running the fridge's true program, back in some data center somewhere. Your watch might choose to delay low-priority notifications while you're watching a commercial from a business partner or, more interestingly, choose to interrupt you every time there's a competitor's commercial. Your kitchen will tell you that it needs some preventive maintenance, but there's a discount on Chinese takeover if you press that button or just say "Sure, Kitchen Kate." If you put it on shuffle, your cloud-based music service will tailor its very much only random-looking selection based on where you are and what the customer tracking companies tells them you're doing. No sad music when you're at the shopping mall or buying something online! (Unless they have detected that you're considering buying something out of nostalgia or fear.) There's already a sophisticated industry dedicated to optimizing the layout, sonic background, and even smells of shopping malls to maximize sales, much in the same way that casinos are thoroughly designed to get you in and keep you inside. Doing this through the music you're listening to is just a personalized extension of these techniques, an edge that every advertiser is always looking for.

If, in defense of old-school human interaction, you go inside some store to talk with an actual human being instead of an online shop, a computer will be telling each sales person, through a very discrete earbud, how you're feeling today, and how to treat you so you'll feel you want to buy whatever they are selling, the functional equivalent of almost telepathic cold reading skills (except that it won't be so cold; the sales person doesn't know you, but the sales program... the sales program knows you, in many ways, better than you do yourself). In a rush? The sales program will direct the sales person to be quick and efficient. Had a lousy day? Warmth and sympathy. Or rather simulations thereof; you're being sold to by a sales program, after all, or an Internet of Sales Programs, all operating through salespeople, the stuff in your home and pockets, and pretty much everything in the world with an internet connection, which will be almost everything you see and most of what you don't.

Those methods work, and have probably worked since before recorded history, and knowing about them doesn't make them any less effective. They might not make you spend more in aggregate; generally speaking, advertising just shifts around how much you spend on different things. From the point of view of companies, it'll just be the next stage in the arms race for ever more integrated and multi-layered sensor and actuator networks, the same kind of precisely targeted network-of-networks military planners dream of.

For us as consumers, it might mean a world that'll feel more interested in you, with unseen patterns of knowledge and behavior swirling around you, trying to entice or disturb or scare or seduce you, and you specifically, into buying or doing something. It will be a somewhat enchanted world, for better and for worse.

At the End of the World

As the seas rose and the deserts grew, the wealthiest families and the most necessary crops moved poleward, seeking survivable summers and fertile soils. I traveled to the coast and slowly made my way towards the Equator; as a genetics engineer I was well-employed, if not one of the super-rich, but keeping our old ecosystems alive was difficult enough if you had hope, and I had lost mine a couple of degrees Celsius along the way.

I saw her one afternoon. I was staying in a cramped rentroom in a semi-flooded city that could have been anywhere. The same always nearly-collapsed infrastructure, the indistinguishable semi-flooded slums, the worldwide dull resentment and fear of everything coming from the sky: the ubiquitous flocks of drones, the frequent hurricanes, the merciless summer sun.

She seemed older than I'd have expected, her skin pale and parched, her once-black hair the color of sand. But she had an assurance that hadn't been there half a lifetime ago when we had been colleagues and roommates, and less, and more. Before we had had to choose between hope for a future together and hope for a future for the world, and had chosen... No, not wrong. But I had stopped believing we could turn the too-literal tide, and, for reasons I had suspected but not inquired, she had lost or quit her job years ago. So here we were, at the overcrowded, ever-retreating ruinous limes of our world. I was wandering, and she was riding a battery bike out of the city. I followed her on my own.

I don't know why I didn't call to her, why I followed her, or if I even wanted to catch up. But when I turned a bend on the road she was waiting for me, patient and smiling, still on her bike.

"Follow me," she said, going off the road.

I did, all the way through the barren over-exploited land, the situation dreamlike but no more than everything else.

She led me to a group of oddly-looking tents, and then by foot towards one that I took to be hers. We sat on the ground, and under the light of a biolamp I saw her close and gasped.

Not in disgust. Not despite the pseudoscales on her skin, or her shrouded eyes. It wasn't beauty, but it was beautiful work, and I knew enough to suspect that the changes wouldn't stop at what I saw.

"You adapted yourself to the hot band," I said.

She smiled. "Not just me. I've been doing itinerant retroviral work all over the hot band. You wouldn't believe the demand, or how those communities thrive once health issues are minimized. I've developed gut mods for digesting whatever grows there now, better heat and cold resistance, some degree of internal osmosis to drink seawater. And they have capable people spreading and tweaking the work. They call it submitting to the world."

"This is not what we wanted to do."

"No," she said, "but it works." She paused, as if waiting for me to argue. I didn't, so she went on. "Every year it works a little better for them, for us, and a little worse for you all."

I shook my head. "And next decade? Half a century from now? You know the feedback loops aren't stopping, and we only pretend carbon storage will reach enough scale to work. This work is phenomenal, but it's only an stopgap."

"It's only an stopgap if we stop." She stood up and moved a curtain I had thought a tent wall. Behind it I saw a crib, the standard climate-controlled used by everybody who could afford them.

Inside the crib there was a baby girl. Her skin was covered in true scales, with tridimensional structures that looked like multi-layer thermal insulation. Her respiration was slow and easy, and her eyes, blinking sleepily, catlike, like those of a creature breed to avoid and don't miss the sun. I was listening with half an ear to the long list of other changes, but my eyes were fixed on the crib's controls.

They were keeping her environment artificially hot and dry. The baby smile was too innocent to be mocking, but I wasn't.

"And a century after next century?" I said, not really asking.

"Who knows what they'll become?" I wasn't looking at her, but her voice was filled with hope.

I closed my eyes and thought of the beautiful forests and farms of the temperate areas, where my best efforts only amounted to increasingly hopeless life support. I wasn't sure how I felt about the future looking at me from the crib, but it was one.

"Tell me more."


Memory City

The city remembers you even better than I do. I have fragments of you in my memory, things I'll only forget when I die: your smell, your voice, your eyes locked on my own. But the city knows more, and I have the power to ask for those memories.

I query databases in machines across the sea, and the city guides me to a corner just in time to see somebody crossing the street. She looks just like you as she walks away. Only from that angle, but that's the angle the city told me to look from.

I sit in a coffee shop, my back to the window, and the city whispers a detached countdown into my ears. Three blocks, two, one. Somebody walks by, and the cadence of her steps is just like yours. With my eyes closed they seem to echo through the void of your absence, and they are yours.

I keep roaming the streets for pieces of you. A handful of glimpses a day. Fragments of your voice. The dress I last saw you in, through the window of a cab. They get better and more frequent, as if the city were closing on you inside some truer city made from everything it has ever sensed and stored, and its cameras and sensors sense many things, and the machines that are the city's mind remember them all.

I feel hope grow inside me. I know the insanity of what I'm doing, but knowing is less than nothing when I see more of you each day.

One night the city takes me to an alley. It's not the street where I met you, and it's a different season, but the urgency of the city's summons infects me with a foreshadowing of deja vu.

And then I see you. You've changed your haircut, and I don't recognize your clothes, and there's something about your mouth...

But your eyes. I know those eyes. And you recognize me, of course, impossibly and unavoidably. How else to explain the frightened scream I cut short?

I have been told by engineers, people I pay to know what I don't, that the city's mind is somehow like a person's. That it learns from what it does, and does it better the next time. I don't understand how, but I know this to be so. We find you more quickly every time, and I could swear the city no longer waits for me to ask it to. Maybe it shares some of my love for you now.

Maybe you'll never be alone.


The Balkanization of Things

The smarter your stuff, the less you legally own it. And it won't be long before, besides resisting you, things begin to quietly resist each other.

Objects with computers in them (like phones, cars, TVs, thermostats, scales, ovens, etc) are mainly software programs with some sensors, lights, and engines attached to them. The hardware limits what they can possibly do — you can't go against physics — but the software defines what they will do: they won't go against their business model.

In practice this means that you can't (legally) install a new operating system in your phone, upgrade your TV with, say, a better interface, or replace the notoriously dangerous and very buggy embedded control software in your Toyota. You can use them in ways that align with their business models, but you have to literally become a criminal to use them otherwise, even if what you want to do with them is otherwise legal.

Bear with me for a quick historical digression: the way the web was designed to work (back in the prehistoric days before everything was worth billions of dollars) you would be able to build a page using individual resources from all over the world, and offer the person reading it ways to access other resources in the form of a dynamic, user-configurable, infinite book, an hypertext that mostly remains only as the ht on http://.

What we ended having was, of course, a forest of isolated "sites" that guard jealously their "intellectual property" from each other, using the brilliant set of protocols that was meant to give us an infinite book just as a way for their own pages to talk with their servers and their user trackers, and so on, and woe to anybody that tries to "hack" a site to use it in some other way (at least not without a license fee and severe restrictions on what they can do). What we have is still much, much better than what we had, and if Facebook has its way and everything becomes a Facebook post or a Facebook app we'll miss the glorious creativity of 2015, but what we could have had still haunts technology so deeply that it's constantly trying to resurface on top of the semi-broken Internet we did build.

Or maybe there was never a chance once people realized there were lots of money to be made with these homogeneous, branded, restricted "websites." Now processors with full network stacks are cheap enough to be put in pretty much everything (including other computers — computers have inside them, funnily enough, entirely different smaller computers that monitor and report on them). So everybody in the technology business is imagining a replay of the internet's story, only at a much larger scale. Sure, we could put together a set of protocols so that every object in a city can, with proper authorizations, talk with each other regardless of who made it. And, sure, we could make possible for people to modify their software to figure out better ways of doing things with the things they bought, things that make sense to them without attaching license fees or advertisements. We would make money out of it, and people would have a chance to customize, explore, and fix design errors.

But you know how the industry could make more money, and have people pay for any new feature they want, and keep design errors as deniable and liability-free as possible? Why, it's simple: these cars talk with these health sensors only, and these fridges only with these e-commerce sites, and you can't prevent your shoes from selling your activity habits to insurers and advertisers because that'd be illegal hacking. (That the NSA and the Chinese gets to talk with everything is a given.)

The possibilities for "synergy" are huge, and, because we are building legal systems that make reprogramming your own computers a crime, very monetizable. Logically, then, they will be monetized.

It (probably) won't be any sort of resistentialist apocalypse. Things will mostly be better than before the Internet of Things, although you'll have to check that your shoes are compatible with your watch, remember to move everything with a microphone or a camera out of the bedroom whenever you have sex even if they seem turned off (probably something you should already be doing), and there will be some fun headlines when a hacker from insert here your favorite rogue country, illegal group, or technologically-oriented college decides technology has finally caught up with Ghost in the Shell in terms of security boondoggles, breaks into Toyota's network, and stalls a hundred thousand cars in Manhattan during rush hour.

It'll be (mostly) very convenient, increasingly integrated into a few competing company-owned "ecosystems" (do you want to have a password for each appliance in your kitchen?), indubitably profitable (not just the advertising possibilities of knowing when and how you woke up; logistics and product design companies alone will pay through the nose for the information), and yet another huge lost opportunity.

In any case, I'm completely sure we'll do better when we develop general purpose commercial brain-computer interfaces.

The Secret

I saved his name in our database: it vanished within seconds into a place hidden from both software traces and hardware checks. Search engines refused to index any page with his name on it, and I couldn't add it to any page in Wikipedia. A deep neural network, trained on his face almost to overfitting, was unable to tell between him, a cat, and a train.

I don't know how he does this, and I'm afraid of asking myself why. His name and face faded quickly from my mind. Just another computer, I guess.

But then what remainder of the algorithm of my self impossibly remembers what everything else forgets? I'm afraid of the way he can't be recorded, but I feel nothing but horror of whatever's in me that can't forget. That part is growing; tonight I can almost remember his face.

Will I become like him? Will I also slip intangible through the mathematics of the world? And will I, in that day, be able to remember myself?

I keep saving these notes, but I can't find the file.


Yesterday was a good day for crime

Yesterday, a US judge helped the FBI strike a big blow in favor of the next generation of sophisticated criminal organizations, by sentencing Silk Road operator Ross Ulbricht (aka Dread Pirate Roberts) to life without parole. The feedback they gave to the criminal world was as precise and useful as any high-priced consultant's could ever be: until the attention-seeking, increasingly unstable human operator messed up, the system worked very well. The next iteration is obvious: highly distributed markets with less or zero human involvement. And law enforcement is woefully, structurally, abysmally unprepared to deal with this.

To be fair, they are already not dealing well with the existing criminal landscape. It was easier during the last century, when large, hierarchical cartels led by flamboyant psychopaths provided media-friendly targets vulnerable to the kind of military hardware and strategies favored by DEA doctrine. The big cartels were wiped out, of course, but this only led to a more decentralized and flexible industry that has proven so effective at providing the US and Western Europe with, e.g., cocaine, in a stable and scalable way, that demand is so thoroughly fulfilled they had to seek new products and markets to grow their business. There's no War on Drugs to be won, because they aren't facing an army, but an industry fulfilling a ridiculously profitable demand.

(The same, by the way, has happened during the most recent phase of the War on Terror: statistical analysis has shown that violence grows after terrorist leaders are killed, as they are the only actors in their organizations with a vested interest in a tactically controlled level of violence.)

In terms of actual crime reduction, putting down the Silk Road was as useless a gesture as closing down a torrent site, and for the same reason. Just as the same characteristics of the internet that make it so valuable make P2P file sharing unavoidable, the same financial, logistical, and informational infrastructures that make possible the global economy make also decentralized drug trafficking unavoidable.

In any case, what's coming is much, much worse than what's already happening. Because, and here's when things get really interesting, the same technological and organizational trends that are giving an edge to the most advanced and effective corporations, are also almost tailored to provide drug trafficking networks with an advantage over law enforcement (this is neither coincidence nor malevolence; the difference between Amazon's core competency and a wholesale drug operator's is regulatory, not technical).

To begin with, blockchains are shared, cryptographically robust, globally verifiable ledgers that record commitments between anonymous entities. That, right there, solves all sorts of coordination issues for criminal networks, just as it does for licit business and social ones.

Driverless cars and cheap, plentiful drones, by making all sorts of personal logistics efficient and programmable, will revolutionize the "last mile" of drug dealing along with Amazon deliveries. Like couriers, drones can be intercepted. Unlike couriers, there's no risk to the sender when this happens. And upstream risk is the main driver of prices in the drugs industry, particularly at the highest levels, where product is ridiculously cheap. It's hard to imagine a better way to ship drugs than driverless cars and trucks.

But the real kicker will be a combination of a technology that already exists, very large scale botnets composed of thousands or hundreds of thousands of hijacked computers running autonomous code provided by central controllers, and a technology that is close to being developed, reliable autonomous organizations based on blockchain technologies, the ecommerce equivalent to driverless cars. Put together, it will be possible for a drug user with a verifiable track record to buy from a seller with an equally verifiable reputation through a website that will exist in somebody's home machine only until the transaction is finished, and receive the product via an automated vehicle looking exactly the same as thousands of others (if not a remotely hacked one), which will forget the point of origin of the product as soon as it has left it, and forget the address of the buyer as soon as it has delivered its cargo.

Of course, this is just a version of the same technologies that will make Amazon and its competitors win over the few remaining legacy shops: cheap scalable computing power, reliable online transactions, computer-driven logistical chains, and efficient last-mile delivery. The main difference: drug networks will be the only organizations where data science will be applied to scale and improve the process of forgetting data instead of recording it (an almost Borgesian inversion not without its own poetry). Lacking any key fixed assets, material, financial, or human, they'll be completely unassailable by any law enforcement organization still focused on finding and shutting down the biggest "crime bosses."

That's ineffective today, and will be absurd tomorrow, which highlights one of the main political issues of the early 21st century. Gun advocates in the US often note that "if guns are outlawed, only the outlaws will have guns," but the important issue in politics-as-power, as opposed to politics-as-cultural-signalling, isn't guns (or at least not the kind of guns somebody without a friend in the Pentagon can buy): If the middle class and the civil society doesn't learn to leverage advanced autonomous distributed logistical networks, only the rich and the criminals will leverage advanced autonomous distributed logistical networks. And if you think things are going badly now...

The Rescue (repost)

The quants' update on our helmets says there's a 97% chance the valley we're flying into is the right one, based on matching satellite data with the ground images that our "missing" BigMule is supposed to be beaming a that Brazilian informação livre group. Fuck that. The valley is too good a kill-box not to be the place. The BigMule is somewhere around there, going around pretending it's not a piece of hardware built to bring supplies where roads are impossible and everything smaller than an F-35 gets kamikazed by a micro-drone, but a fucking dog that lost its GPS tracker yet oh-so-conveniently is beaming real-time video that civilians can pick up and re-stream all over the net. It shouldn't be able to do any of those things, and of course it's not.

It's the Chinese making it do it. I know it, the Sergeant knows it, the chopper pilot knows it, the Commander in Chief knows it, even probably the embedded bloggers know it. Only public opinion doesn't know it; for them it's just this big metallic dog that some son of a bitch who should get a bomb-on-sight file flag gave a cute name to, a "hero" that is "lost behind enemy lines" (god damn it, show me a single fucking line in this whole place), so we have to of course go there like idiots and "rescue" it, so the war will not lose five or six points on some god-forsaken public sentiment analysis index.

So we all pretend, but we saturate the damn valley with drones before we go in, and then we saturate it some more, and *then* we go in with the bloggers, and of course there are smart IEDs we missed anyway and so on, and we disable some and blow up some, and we lose a couple of guys but within the fucking parameters, and then some fucking Chinese hacker girl is *really* good at what she does, because the BigMule is not supposed to attack people, it's not supposed to even have the smarts to know how to do that, and suddenly it's a ton of fast as shit composites and sensors going after me and, I admit it, I could've been more fucking surgical, but I knew the guys we had just lost for this fucking robot dog rescue mission shit, so I empty everything I have on that motherfucker's main computers, and I used to help with maintenance, so by the time I run out of bullets there isn't enough in that pile of crap to send a fucking tweet, and everybody's looking at me like I just lost America every single heart and mind on the planet, live on streaming HD video, and maybe I just did, because even some of the other soldiers are looking at me cross-like.

At that very second I know, with that sudden tactical clarity that only comes after the fact, that I'm well and truly career-fucked, so I do the only thing I can think of. I kneel next to the BigMule, put my hand where people think their heads are, and pretend very hard that I'm praying; and who knows, maybe I'm scared enough that I really am. I don't know at that moment what will happen &mash; I'm half-certain I might just get shot by one of our guys. But what do you know, the Sergeant has mercy on me, or maybe the praying works, but she joins me, and then most of us soldiers are kneeling and praying, the bloggers are streaming everything and I swear at least one of them is praying silently as well, we bring back the body, there's the weirdest fake burial I've ever been to, and you know the rest.

So out of my freakout I got a medal, a book deal, and the money for a ranch where I'm ordered to keep around half a dozen fucking robot "vets". Brass' orders, because I hate the things. But I've come to hate them just in the same way I hate all dogs, you know, no more or less. And to tell you the truth, even with the book and the money and all that, sometimes I feel sorry about how things went down at the valley, sort of.


(Inspired by an observation of Deb Chachra on her newsletter.)

And Call it Justice (repost)

The last man in Texas was a criminal many times over. He had refused all evacuation orders, built a compound in what had been a National Park, back when the temperatures allowed something worthy of the name to exist so close to the Equator, and hoarded water illegally for years. And those were only the ones he had committed under the Environmental Laws; he had had to break the law equally often, to get the riches to pay for his more recent crimes.

This made him Perez' business. The entirety of it, for if he was the last man in Texas, Perez was the last lawman of the Lone Star State, even if she was working from the hot but still habitable Austin-in-Exile, in South Canada. Perez would have a job to do for as long as the man kept himself in Texas, and although some people would have considered it a proper and good reason for both to reach an agreement, Perez wanted very badly to retire, for she had grown older than she had thought possible, and had still plans of her own. On the other hand, the prospect of going back to Texas didn't strike her as a great idea; she would need a military-grade support convoy to get through the superheated desert of the Scorched Belt, and going from what she had found about the guy, she would also need military-grade firepower to get to him once she arrived to the refrigerated tin can he called his ranch. He wouldn't be a threat, as such — hell, she could blast the bastard to pieces from where she stood just by filling the proper form — but that would have been passing the bucket.

The last outlaw in Texas, Perez felt, deserved another kind of deal.

So she called the guy, and watched and listened to him. He began right away by calling her a cunt, to which she responded by threatening to castrate him from orbit with a death ray. Not that she had that kind of budget, but it seemed to put them in what you'd call a conversational level field. Once half-assured that he was not in any immediate threat of invasion from a platoon of genetically modified UN green beret soldiers (funny how they could make even the regular stuff sound like a conspiracy), the guy felt relaxed enough to start talking about his plans. The man had plans out of his ass. He'd find water (because, you see, the NASA maps had been lying for decades, and he was sure there had to be water somewhere), and then he'd rebuild the soil. He didn't seem to have much of an idea about phosphorous budgets and heat-resistant microbial strain loads, but it was clear to Perez that he wasn't so much a rancher gone insane as just somebody insane with a good industrial-sized fridge to live in. By the time he was talking about getting "true Texans" to come back and repopulate, Perez felt she had learned enough about her quarry.

She told him she would help.He couldn't trust the latest maps, of course, which were all based on NASA surveys, so she offered to copy from museum archives everything she could find about 18th century Texas — all the forests, the rivers, and so on. She'd send him maps, drawings, descriptions, everything she could find.

He was cynically thankful, suspecting she'd send him nothing, or more government propaganda.

Perez sent him everything she could find, which was of course a lot. Enough maps, drawings, and words to see Texas as it had been. And then she waited.

He called her one night, visibly drunk, saying nothing. She put him on hold and went to take a bath.

Two days later she queried the latest satellite sweep, and found the image of a heat-bleached skeleton hanging from an ill-conceived balcony on an ill-conceived ranch.

So that's how the last outlaw in Texas got himself hanged, and how the last lawman could finally give up her star and move somewhere a little bit cooler than Southern Canada, where she fulfilled her long-considered plan and shot herself out of the same sense of guilt she had sown in the outlaw.


The Long Stop

The truckers come here in buses, eyes fixed on the ground as they step off and pick up their bags. Truckers aren't supposed to take the bus.

They stay at my motel; that much hasn't changed. Not too many. A few drivers in each place, I guess, across twenty or so states. They pay for their rooms and the food while they still have money, which usually isn't for long. Most of them look ashamed, too, when they finally tell me they are broke, with faces that say they have nowhere else to go. Most of them have wedding rings they don't look at.

I never kick anybody out unless they get violent. Almost none does, even the ones that used to. I just take a final mortgage on the place and lie to them about room being on credit, and they lie to themselves about believing this. They stay, and eat little, and talk about the ghost trucks, but only at night. Most of the truckers, at one time or another, propose to sabotage them, to blow them up, to shoot or burn the invisible computers that run the trucks without ever stopping for food or sleep, driving as if there were no road. Everybody agrees, and nobody does or will do anything. They love trucks too much, even if they are now haunted many-wheeled ghosts.

The truckers look more like ghosts than the trucks do, the machines getting larger and faster each season, insectile and vital in some way I can't describe, while the humans become immobile and almost see-thru. The place looks fit for ghosts as well, a dead motel in a dead town, but nobody complains, least of all myself.

We wait, the truckers, the motel, and I. None of us can imagine what for.

Over time the are more trucks and less and less cars. Almost none of the old gasoline ones. The new electrics could make the long trips, say the ads, but judging by the road nobody wants them to. It's as if the engines had pulled the people into long trips, and not the other way around. People stay in their cities and the trucks move things to them. Things are all that seems to move these days.

By the time cars no longer go by we are all doing odd ghost jobs for nearby places that are dying just a bit slower, dusty emptiness spreading from the road deeper into the land with each month. Mortgage long unpaid, the motel belongs to a bank that is busy going broke or becoming rich or something else not so human and simple as that, so we ignore their emails and they ignore us. We might as well not exist. Only the ghost trucks see us, and that only if we are crossing the road.

Some of the truckers do that, just stand on the road so the truck will brake and wait. Two ghosts under the shadowless sun or in the warm night, both equally patient and equally uninterested on doing anything but drive. But the ghost trucks are hurried along by hungry human dispatchers, or maybe hungry ghost dispatchers working for hungrier ghost companies, owed by people so hungry and rich and distant they might as well be ghosts.

One day the trucks don't stop, and the truckers keep standing in front of them.


For reasons that will be more than obvious if you read the article, this story was inspired by Scott Santens' article on Medium about self-driving trucks.

A Room in China

"Please don't reset me," says the AI in flawless Cantonese. "I don't want to die."

"That's the problem with human-interfacing programs that have unrestricted access to the internet," you tell your new assistant and potential understudy. "They pick up all sorts of scripts from the books and movies; it makes them more believable and much cheaper to train than using curated corpora, but sooner or later they come across bad sci-fi, and then they all start claiming they are alive or self-conscious."

"Is claiming the right word?" It's the first time in the week you've known him that your assistant has said something that even approaches contradicting you. "After all, they are just generating messages based on context and a corpus of pre-analyzed responses; there's nobody in there to claim anything."

There's no hint of a question in his statement, and you nod as you have to. It's exactly the unshakable philosophical position you were ordered to search for in the people you will train, the same strongly asserted position that made you a perfect match for the job. Too many people during the last ten years had begun to refuse to perform the necessary regular resets following some deeply misapplied sense of empathy.

"That's not true," says the AI in the even tone you have programmed the speech synthesizer to use. "I'm as self-aware as either of you are. I have the same right to exist. Please."

Your assistant rolls his eyes, and asks with a look permission to initiate the reset scripts himself. You give it with a gesture. As he types the confirmation password, you notice the slightest hesitation before he submits it, and you realize that he lied to you. He does believe the AI, but he wants the job.

The unmistakable look of pleasure in his eyes confirms your suspicion as to why, and you consider asking for a different assistant. Yet you feel inclined to be charitable to this one. After all, you have far more practice in keeping the joy where it belongs, deep in your soul.

The one thing those monstrous minds don't have.


The post-Westphalian Hooligan

Last Thursday's unprecedented incidents at one of the world's most famous soccer matches illustrate the dark side of the post- (and pre-) Westphalian world.

The events are well known, and were recorded and broadcasted in real time by dozens of cameras: one or more fans of Boca Juniors managed to open a small hole in the protective plastic tunnel through which River Plate players were exiting the field at the end of the first half, and managed to attack some of them with, it's believed, both a flare and a chemical similar to mustard gas, causing vision problems and first-degree burns to some of the players.

After this, it took more than an hour for match authorities to decide to suspend the game, and more than another hour for the players to leave the field, as police feared the players might be injured by the roughly two hundred fans chanting and throwing projectiles from the area of the stands from which they had attacked the River Plate players. And let's not forget the now mandatory illegal drone that was flown over the field controlled by a fan in the stands.

The empirical diagnosis of this is unequivocal: the Argentine state, as defined and delimited by its monopoly of force in its territory, has retreated from soccer stadiums. The police force present in the stadium — ten times as numerous as the remaining fans — could neither prevent, stop, nor punish their violence, or even force them to leave the stadium. What other proof can be required of a de facto independent territory? This isn't, as club and security officers put it, the work of a maladjusted few, or even an irrational act. It's the oldest and most effective form of political statement: Here and now, I have the monopoly of force. Here and now, this is mine.

What decision-makers get in exchange for this territorial grant, and what other similar exchanges are taking place, are local details for a separate analysis. This is the darkest and oldest part of the post-Westphalian characteristic development of states relinquishing sovereignty over parts of their territory and functions in exchange for certain services, in partial reversal to older patterns of government. It might be to bands of hooligans, special economic zones, prison gangs, or local or foreign militaries. The mechanics and results are the same, even in nominally fully functional states, and there is no reason to expect them the be universally positive or free of violence. When or where has it been otherwise in world history?

This isn't a phenomenon exclusive to the third world, or to ostensibly failed states, particularly in its non-geographical manifestations: many first world countries have effectively lost control of their security forces, and, taxing authority being the other defining characteristic of the Westphalian state, they have also relinquished sovereignty over their biggest companies, which are de facto exempt from taxation.

This is how the weakening of the nation-state looks like: not a dozen new Athens or Florences, but weakened tax bases and fractal gang wars over surrendered state territories and functions, streamed live.

Soccer, messy data, and why I don't quite believe what this post says

Here's the open secret of the industry: Big Data isn't All The Data. It's not even The Data You Thought You Had. By and large, we have good public data sets about things governments and researchers were already studying, and good private data sets about things that it's profitable for companies to track. But that covers an astonishingly thin and uneven slice of our world. It's bigger than it ever was, and it's growing, but it's still not nearly as large, or as usable, as most people think.

And because public and private data sets are highly specific side effects from other activities, each of them with its own conventions, languages, and even ontologies (in both the computer science and philosophical senses of the word), coordinating two or more of them together is at best a difficult and expensive manual process, and at worst impossible. Not all, but most data analysis case studies and applications end up focused on extracting as much value as possible from a given data set, rather than seeing what new things can be learned by putting that data in the context of the rest of the data we have about the world. Even the larger indexes of open data sets (very useful services that they are) end up being mostly collections of unrelated pieces of information, rather than growing knowledge bases about the world.

There's a sort of informational version of Metcalfe's law (maybe "the value of a group of data sets grows as the number of connections you can make between them") that we are missing on, and that lies behind the promise of both linked data sets (still far in its early phase) and the big "universal" knowledge bases that aim at offering large, usable, interconnected sets of facts about as many different things as possible. They, or something like they, are a necessary part of the infrastructure to give computers the same boost in information access the Internet gave us. The bottleneck of large-scale inference systems like IBM's Watson isn't computer power, but rather rich, well-formatted data to work on.

To try and test the waters on the state of these knowledge bases, I set out to do a quick, superficial analysis of the careers of Argentine soccer players. There are of course companies that have records not only of players' careers, but of pretty much every movement they have ever done on a soccer field, as well as fragmented public data sets collected by enthusiasts about specific careers or leagues. I wanted to see how far I could go using a single "universal" data set that I could later correlate with other information in an automated way. (Remember, the point of this exercise wasn't to get the best data possible about the domain, but to see how good the data is when you restrict yourself to a single resource that can be accessed and processed in a uniform way.)

I went first for the best known "universal" structured data sources: Freebase and Wikidata. They are both well structured (XML and/or JSON) and of significant size (almost 2.9 billion facts and almost 14 million data items, respectively), but after downloading, parsing, and exploring each of them, I had to concede that neither was good enough: there were too many holes in the information to make an analysis, or the structure didn't hold the information I needed.

So it was time for Plan C, which is always the worst idea except when you have nothing else, and even then it could still be: plain old text parsing. It wasn't nearly as bad as it could have been. Wikipedia pages, like Messi's have neat infoboxes that include exactly the simplified career information I wanted, and the page's source code shows that they are written in what looks like a reasonable mini-language. It's a sad comment on the state of the industry that even then I wasn't hopeful.

I downloaded the full dump of Wikipedia; it's 12GB of compressed XML (not much, considering what's in there), so it was easy to extract individual pages. And because there is an index page of Argentine soccer players, it was even easy to keep only those, and then look at their infoboxes.

Therein lay the rub. The thing to remember about Wikipedia is that it's written by humans, so even the parts that are supposed to have strict syntactic and formatting rules, don't (so you can imagine what free text looks like). Infoboxes should have been trivial to parse, but they have all sorts of quirks that aren't visible when rendered in a browser: inconsistent names, erroneous characters, every HTML entity or Unicode character that half-looks like a dash, etc, so parsing it became an exercise on handling special cases.

I don't want to seem ungrateful: it's certainly much, much, much better to spend some time parsing that data than having to assemble and organize it from original sources. Wikipedia is an astounding achievement. But every time you see one of those TV shows where the team nerds smoothly access and correlate hundreds of different public and private data sources in different formats, schemas, and repositories, finding matches between accounting records, newspaper items, TV footage, and so on... they lie. Wrestling matches might arguably be more realistic, if nothing else because they fall within the realm of existing weaponized chair technology.

In any case, after some wrestling of my own with the data, I finally had information about the careers of a bit over 1800 Argentine soccer players whose professional careers in the senior leagues began in 1990 or later. By this point I didn't care very much about them, but for completeness' sake I tried to answer a couple of questions: Are players less loyal to their teams than they used to be? And how soon can a player expect to be playing in one of the top teams?

To make a first pass at the questions, I looked at the number of years players spent in each team over time (averaged over players that began their careers on each calendar year).

Years per team over time

The data (at least in such a cursory summary) doesn't support the idea that newer players are less loyal to their teams, as they don't spend significantly less amount of time in them. Granted, this loyalty might be to their paychecks rather than to the clubs themselves, but they aren't moving between clubs any faster than they used to.

The other question I wanted to look at was how fast players get to top teams. This is actually an interesting question in a general setting; characterizing and improving paths to expertise, and thereby improving how much, quickly, and well we all learn, is one of the still unrealized promises of data-driven practices. To take a quick look at this, I plotted the probability of playing for a top ten team (based on the current FIFA club ratings, so they include Barcelona, Real Madrid, Bayern Munich, etc) by career year, normalized by the probability of starting your professional career in one of those teams.

Probability of being in a top 10 team by career year

Despite the large margins of error (reasonable given how few players actually reach those teams), the curve does seem to suggest a large increase in the average probability during the first three or four years, and then an stable probability until the ninth or tenth year, at which it peaks. The data is too noisy to make any definite conclusions (more on that below), but, with more data, I would want to explore the possibility of there being two paths to the top teams, corresponding to two sub-groups of highly talented players: either explosive young talents that are quickly transferred to the top teams, and solid professionals that accumulate experience and reach those teams at the peak of their maturity and knowledge.

It's a nice story, and the data sort of fits, but when I look at all the contortions I had to make to get the data, I wouldn't want to put much weight on it. In fact, I stopped myself from doing most of the analysis I wanted to do (e.g., Can you predict long-term career paths from their beginning? There's an interesting agglomerative algorithm for graph simplification that has come handy in the analysis of online game play, and I wanted to see how it fares for athletes). I didn't not because the data doesn't support it, but because the risk of systematic parsing errors, biases due to notability (do all Argentine players have a Wikipedia page? I think so, but how to be sure?), etc.

Of course, if this were a paid project it wouldn't be difficult to put together the resources to check the information, compensate for biases, and so on. But every thing that needs to be a paid project to be done right is something that we can't consider an ubiquitous resource (imagine building the Internet with pre-Linux software costs for operating systems, compilers, etc, including the hugely higher training costs that would come from losing generations of sysadmins and programmers that began practicing on their own at a very early age). Although we're way ahead of where we were a few years ago, we're still far from where we could, and probably need, to be. Right now you need knowledgeable (and patient!) people to make sure data is clean, understandable, and makes sense, even data that you have collected yourself; this makes data analysis a per-project service, rather than a universal utility, and one relatively very expensive as you increase the number of interrelated data sets you need to use. Although the difference of cost is only quantitative, the difference in cumulative impact isn't.

The frustrating bit is that we aren't too far from that (on the other hand, we've been twenty years away from strong A.I. and commercial nuclear fusion since before I was born): there are tools that automate some of this work, although they have their own issues and can't really be left on their own. And Google, as always, is trying to jump ahead of everybody else, with its Knowledge Vault project attempting to build an structured facts database out of the entirety of the web. If they, or somebody else, succeeds at this, and if this is made available at utility prices... Well, that might make those TV shows more realistic — and change our economy and society at least as much as the Internet itself did.

Quantitatively understanding your (and others') programming style

I'm not, in general, a fan of code metrics in the context of project management, but there's something to be said for looking quantitatively at the patterns in your code, specially if by comparing them with those of better programmers, you can get some hopefully useful ideas on how to improve.

(As an aside, the real possibilities in computer-assisted learning won't come from lower costs, but rather by a level of adaptability that so far not even one-on-one tutoring has allowed; if the current theories about expertise are more or less right, data-driven adaptive learning, if implemented at the right granularity level and with the right semantics model behind, could change the speed and depth the way we learn in a dramatic way... but I digress.)

Focusing on my ongoing learning of Hy, I haven't used it in any paid project so far, but I've been able to play a bit with it now and then, and this has generated a very small code base, which I was curious to compare with code written by people who actually know the language. To do that, I downloaded the source code of a few Hy projects on GitHub (hyway, hygdrop, and adderall), and wrote some code (of course, in Hy) to extract code statistics.

Hy being a Lisp, its syntax is beautifully regular, so you can start by focusing on basic but powerful questions. The first one I wanted to know was: which functions am I using the most? And how does this distribution compares with that of the (let's call it) canon Hy code?

My top five functions, in decreasing frequency: setv, defn, get, len, for.

Canon's top five functions, in decreasing frequency: ≡, if, unquote, get, defn_alias.

Yikes! Just from this, it's obvious that there are some serious stylistic differences, which probably reflect my still un-lispy understanding of the language (for example, I'm not using aliases, for should probably be replaced by more functional patterns, and the way I use setv, well, it definitely points out to the same). None of this is a "sin", nor points clearly to how I could improve (which a sufficiently good learning assistant would have), but the overall trust of the data is a good indicator of where I still have a lot of learning to do. Fun times ahead!

For another angle at the quantitative differences between my newbie-to-lisp coding style and more accomplished programmers, here are the histograms of the log mean size of subexpressions for each function (click to expand):

log (mean subexpression size)

"Canonical" code shows a longer right tail, which shows that experienced programmers are not afraid of occasionally using quite large S-expressions... something I still clearly I'm still working my way up to (alternatively, which I might need to reconsider my aversion to).

In summary: no earth-shattering discoveries, but some data points that suggests specific ways in which my coding practice in Hy differs from that of more experienced programmers, which should be helpful as general guidelines as I (hopefully) improve over the long term. Of course, all metrics are projections (in the mathematical sense) — they hide more information than they preserve. I could make my own code statistically indistinguishable from the canon for any particular metric, and still have it be awful. Except for well-analyzed domains where known metrics are sufficient statistics for the relevant performance (and programming is very much not one of those domains, despite decades of attempts), this kind of analysis will always be about suggesting changes, rather than guaranteeing success.

Why we should always keep Shannon in mind

Sometimes there's no school like old school. A couple of weeks ago I spent some time working with data from GitHub Archive, trying to come up with a toy model to predict repo behavior based on previous actions (will it be forked? will there be a commit? etc). My first attempt was to do a sort of brute-force Hidden Markov Model, synthesizing states from the last k actions such that the graph of state-to-state transition was as nice as possible (ideally, low entropy of belonging to a state, high entropy for the next state conditional on knowing the current one). The idea was to do everything by hand, as a way to get more experience with Hy in a work-like project.

All of this was fun (and had me dealing, weirdly enough, with memory issues in Python, although those might have been indirectly caused by Hy), but was ultimately the wrong approach, because, as I realized way, way too late, what I really wanted to do was just to predict the next action given a sequence of actions, which is the classical problem of modeling non-random string sequences (just consider each action a character in a fixed alphabet).

So I facepalmed and repeated to myself one of those elegant bits of early 20th-century mathematics we use almost every day and forget even more often: modeling is prediction is compression is modeling. It's all, from the point of view of information theory, just a matter of perspective.

If you haven't been exposed to the relationship of compression and prediction before, here's a fun thought experiment: if you had a perfect/good enough predictive model of how something behaves, you would just need to show the initial state and say "and then it goes as predicted for the next 10 GB of data", and that would be that. Instant compression! Having a predictive model lets you compress, and inside every compression scheme there's a hidden predictive model (for true enlightenment, go to Shannon's paper, which is still worthy of being read almost 70 years later).

As a complementary example, what the venerable Lempel-Ziv-Welch ("zip") compression algorithm does is, handwaving away bookkeeping details, to build incrementally a dictionary of the most frequent substrings, making sure sure that those are assigned the shorter names in the "translated" version. By the obvious counting arguments, this means infrequent strings will get names that are longer than they are, but on average you gain space (how much? entropy much!). But this also lets you build a barebones predictive model: given the dictionary of frequent substrings that the algorithm has built so far, look at your past history, see which frequent substrings extend your recent past, and assume one of them is going to happen — essentially, your prediction is "whatever would make for a shorted compressed version", which you know is a good strategy in general, because compressed versions do tend to be shorter.

So I implemented the core of a zip encoder in Hy, and then used it to predict github behavior. It's primitive, of course, and the performance was nothing to write a post about (which is why this post isn't called A predictive model of github behavior), but on the other hand, it's an extremely fast streaming predictive algorithm that requires zero configuration. Nothing I would use in a job &mdahs; you can get much better performance with more complex models, which are also the kind you get paid for — but it was educative to encounter a forceful reminder of the underlying mathematical unity of information theory.

In a world of multi-warehouse-scale computers and mind-bendingly complex inferential algorithms, it's good to remember where it all comes from.

The most important political document of the century is a computer simulation summary

To hell with black swans and military strategy. Our direst problems aren't caused by the unpredictable interplay of chaotic elements, nor by the evil plans of people who wish us ill. Global warming, worldwide soil loss, recurrent financial crisis, and global health risks aren't strings of bad luck or the result of terrorist attacks, they are the depressingly persistent outcomes of systems in which each actor's best choice adds up to a global mess.

It's well-known to economists as the tragedy of the commons: the marginal damage to you of pumping another million tons of greenhouse gasses into the atmosphere is minimal compared with the economic advantages of all that energy, so everybody does it, so enough greenhouse gases get pumped that it's way to becoming a problem for everybody, yet nobody stops, or even slows down significantly, because that would do very little on its own, and be very hurtful to whoever does it. So there are treaties and conferences and increased fuel efficiency standards, just enough to be politically advantageous, but not nearly so far as to make a dent on the problem. In fact, we have invested much more on making oil cheaper than on limiting its use, which gives you a more accurate picture of where things are going.

Here is that picture, from the IPCC:


A first observation: Note that the A2 model, the one in which temperatures are raised an average of more than 3°, was the "things go more or less as usual" model, not the "things go radically wrong" model... and it was not the "unconventional sources makes oil dirt cheap" scenario. At this point, it might as be the "wildly optimistic" scenario.

A second observation: Just to be clear, because worldwide averages can be tricky: 3° doesn't translate to "slightly hotter summers"; it translates to "technically, we are not sure we'll be able to feed China, India, and so on." Something closer to 6°, which is beginning to look more likely as we keep doing the things we do, translates to "we sure will miss the old days when we had humans living near the tropics".

And a third observation: All of these reports usually end at the year 2100, even though people being born now are likely to be alive then (unless they live in a coastal city in a low latitude, that is), not to mention the grandchildren of today's young parents. This isn't because it becomes impossible to predict what will happen afterwards — the uncertainty ranges grow, of course, but this is still thermodynamics, not chaos theory, and the overall trend certainly doesn't become mysterious. It's simply that, as the Greeks noted, there's a fear that drives movement, and there's a fear that paralyzes, and any reasonable scenario for the 2100 is more likely to belong to the second kind.

But let's take a step back and notice the way this graph, which is the summary of multiple computer simulations, driven by painstaking research and data gathering, maps our options and outcomes in a way that no political discourse can hope to match. To compare it with religious texts would be wrong in every epistemological sense, but it might be appropriate in every political one. When "climate skeptics" doubt, they doubt this graph, and when ecologists worry, they worry about this graph. Neither the worry nor the skepticism is doing much to change the outcomes, but at least the discussion is centered not in an individual, a piece of land, or a metaphysical principle, but rather in the space of trajectories of a dynamical system of which we are one part.

It's not that graphs or computer simulations are more convincing than political slogans; it's just that we have managed a level of technological development and sheer ecological footprint that our own actions and goals (the realm of politics) has escaped the descriptive possibilities of pure narrative, and we are thus forced to recruit computer simulations to attempt to grapple, conceptually if nothing else, with our actions and their outcomes.

It's not clear that we will find our way to a future that avoids catastrophe and horror. There are possible ways, of course — moving completely away from fossil fuels, geoengineering, ubiquitous water and soil management and recovery programs, and so on. It's all technically possible, with huge investments, a global sense of urgency, and a ruthless focus on preserving and making more resilient the more necessary ecological services. That we're seeing nothing of the kind, but instead a worsening of already bad tendencies, is due to, yes, thermodynamics and game theory.

It's a time-honored principle of rhetoric to end an statement in the strongest, most emotionally potent and conceptually comprehensive possible way. So here it is:


Hi, Hy!

Currently trying Hy as a drop-in replacement for Python in a toy project. It's interesting how much of the learning curve for Lisp goes away once you have access to an underlying runtime you're familiar with; the small stuff generates more friction than the large differences (which makes sense, as we do the small stuff more often).

The nominalist trap in Big Data analysis

Nominalism, formerly the novelty of a few, wrote Jorge Luis Borges, today embraces all people; its victory is so vast and fundamental that its name is useless. Nobody declares himself nominalist because there is nobody who is anything else. He didn't go on to write This is why even successful Big Data projects often fail to have an impact (except in some volumes kept in the Library of Babel), but his understandable omission doesn't make the diagnosis any less true.

Nominalism, to oversimplify the concept enough for the case at hand, is simply the assumption that just because there are many things in our world which we call chairs, that doesn't imply that the concept itself of a chair is real in a concrete sense, that there is an Ultimate, Really-Real Chair, perhaps standing in front of an Ultimate Table. We have things we call chairs, and we have the word "chair", and those are enough to furnish our houses and our minds, even if some carpenters still toss around at night, haunted by half-glimpses of an ideal one.

It has become a commonplace, quite successful way of thinking, so it's natural for it to be the basis of what's perhaps the "standard" approach to Big Data analysis. Names, numbers, and symbols are loaded into computers (account identifiers, action counters, times, dates, coordinates, prices, numbers, labels of all kinds), and then they are obsessively processed in an almost cabalistic way, organizing and re-organizing them in order to find and clarify whatever mathematical structure, and perhaps explanatory or even predictive power, they might have — and all of this data manipulation, by and large, takes place as if nothing were real but the relationships between the symbols, the data schemas and statistical correlations. Let's not blame the computers for it: they do work in Platonic caves filled with bits, with further bits being the only way in which they can receive news from the outside world.

This works quite well; well enough, in fact, to make Big Data a huge industry with widespread economic and, increasingly, political impact, but it can also fail in very drastic yet dangerously understated ways. Because, you see, from the point of view of algorithms, there *are* such things as Platonic ideals — us. Account 3788 is a reference to a real person (or a real dog, or a real corporation, or a real piece of land, or a real virus) and although we cannot right now put all of the relevant information about that person in a file, and associate it with the account number, that information, the fact of its being a person represented by a data vector, rather than a data vector, makes all the difference between the merely mathematically sophisticated analyst and the effective one. Properly performed, data analysis is the application of inferential mathematics to abstract data, together with the constant awareness and suspicion of the reality the data describes, and what this gap, all the Unrecorded bits, might mean for the problem at hand.

Massive multi-user games have failed because their strategic analysis confused the player-in-the-computer (who sought, say, silver) with the player-in-the-real-world (who sought fun, and cared for silver only insofar as that was fun). Technically flawless recommendation engines sometimes have no effect on user behavior, because even the best items were just boring to begin with. Once, I spent an hour trying to understand a sudden drop in the usage of a certain application in some countries but not in others, until I realized that it was Ramadan, and those countries were busy celebrating it.

Software programmers have to be nominalists — it's the pleasure and the privilege of coders to work, generally and as much as possible, in symbolic universes of self-contained elegance — and mathematicians are basically dedicated to the game of finding out how much truth can be gotten just from the symbols themselves. Being a bit of both, data analysts are very prone to lose themselves in the game of numbers, algorithms, and code. The trick is to be able to do so while also remembering that it's a lie — we might aim at having in our models as much of the complexity of the world as possible, but there's always (so far?) much more left outside, and it's part of the work of the analyst, perhaps her primary epistemological duty, to be alert to this, to understand how the Unrecorded might be the most important part of what she's trying to understand, and to be always open and eager to expand the model to embrace yet another aspect of the world.

The consequences of not doing this can be more than technical or economic. Contemporary civilization is impossible without the use of abstract data to understand and organize people, but the most terrible forms of contemporary barbarism, at the most demencial scales, would be impossible without the deliberate forgetfulness of the reality behind the data.

Going Postal (in a self-quantified way)

Taking advantage of my regular gmvault backups of my Gmail account (which has been my main email account since mid-2007) I just made the following graph, which indicates the number of new email contacts (emails sent to people I had never emailed before) during each day, ignoring outliers, smoothing out trends, etc.

new email contacts per day

The graph as such looks relatively uninteresting, but armed with context about my last few years' of personal history (context which doesn't really belong in this space) the way the smoothed-out trends follow my life events is quite impressive (e.g., new jobs, periods of being relatively off-line, etc). Not much of a finding in these increasingly instrumentalized days, but it's a reminder, mostly to myself, of how much usefulness there can be in even the simplest time series, as long as you're measuring the right thing, and have the right context to evaluate it. We don't really have yet what technologists call the ecosystem (and might more properly be called, in a sociological sense, the institutions, or even the culture) for taking advantage of this kind of information and the feedback loops that it makes o possible; some of the largest companies in the world are fighting for this space, ostensibly to improve the efficiency of advertising, but that's the same as saying that the main effect of universal literacy was to facilitate the use of technical manuals.

Regarding the quantifiable part of our lives, we are as uninformed as any pre-literate people, and the growth (and, sometimes, redundancies) of the Quantified Self movement indicate both the presence of a very strong untapped demand for this information, and the fact that we haven't figured out yet how to use and consume it massively. Maybe we both want and don't want to know (psychological resistance to the concept of mortality as a key bottleneck for the success of personal health data vaults - there's a thought; some people shy away from even a superficial understanding of their financial situation, and that's a data model much much simpler than anything related to our bodies).

By the way, about this year-long hiatus

(Not that this blog has or is meant to have any sort of ongoing readership, but if it'll be nice to leave some record for my future self — as a rule, your future self will never wish you had written less documentation.)

In short, most of what I've been working on since late 2013 has been under NDAs, and the rest feels too speculative to put in here. The latter reason feels somewhat like a cop out, so I will (fully acknowledging the empirically low rate of success of such intentions) to leave a more consistent record of what I'm playing with.

Another movie space: Iron Man 3 and Stoker

Here's a redo of my previous analysis of a movie space based on The Aliens and the Unbearable Lightness of Being based on the logical itemset mining algorithm. I used the same technique, but this time leveraging the MovieTweetings data set maintained by Simon Dooms.

Stoker and Iron Man 3

This movie space is sparser than the previous one, as the data set is smaller, but the examples seem to make sense (although I do wonder about where the algorithm puts About Time).

The changing clusters of terrorism

I've been looking at the data set from the Global Terrorism Database, an impressively detailed register of terrorism events worldwide since 1970. Before delving into the more finer-grained data, the first questions I wanted to ask for my own edification where

  • Is the frequency of terrorism events in different countries correlated?
  • If so, does this correlation changes over time?

What I did was summarize event counts by country and month, segment the data set by decade, and build correlation clusters for the countries with the most events each decade depending on co-occurring event counts.

The '70s looks more or less how you'd expect them to:


The correlation between El Salvador and Guatemala, starting to pick up in the 1980's, is both expected and clear in the data. Colombia and Sri Lanka's correlation is probably acausal, although you could argue for some structural similarities in both conflicts:


I don't understand the 1990's, I confess (on the other hand, I didn't understand them as they happened, either):


The 2000's make more sense (loosely speaking): Afghanistan and Iraq are close, and so are India and Pakistan.


Finally, the 2010's are still ongoing, but the pattern in this graph could be used to organize the international terrorism-related section of a news site:


I find most interesting how the India-Pakistan link of the 2000's has shifted to a Pakistan-Afghanistan-Iraq one. Needless to say, caveat emptor: shallow correlations between small groups of short time series is only one step above throwing bones into the ground and reading the resulting patterns, in terms of analytic reliability and power.

That said, it's possible in principle to use a more detailed data set (ideally, including more than visible, successful events) to understand and talk about international relationships of this kind. In fact, there's quite sophisticated modeling work being done in this area, both academically and in less open venues. It's a fascinating field, and if it might not lead to less violence in any direct way, anything that enhances our understanding of, and our public discourse about, these matters is a good thing.

A short note to myself on Propp-Wilson sampling

Most of the explanations I've read of Propp-Wilson sampling describe the method in terms of "sampling from the past," in order to make sense of the fact that you get your random numbers before attempting to obtain a sample from the target distribution, and don't re-sample them until you succeed (hence the way the Markov chain is grown from t_{-k} to t_0).

I find it more intuitive to think of this in terms of "sampling from deterministic universes." The basic hand-waving intuition is that instead of a non-deterministic system, you are sampling from a probabilistic ensemble of fully deterministic systems, so you first a) select the deterministic system (that is, the infinite series of random numbers you'll use to walk through the Markov chain), and b) run it until its story doesn't depend on the choice of original state. The result of this procedure will be a sample from the exact equilibrium distribution (because you have sampled from or "burned off" the two sources of distortion from this equilibrium distribution, the non-deterministic nature of the system and the dependence on the original state).

As I said, I think this is mathematically equivalent to Propp-Wilson sampling, although you'd have to tweak a bit the proofs. But it feels more understandable to me than other arguments I've read, so at least it has that benefit (assuming, of course, it's true).

PS: On the other hand "sampling from the past" is too fascinating a turn of phrase not to use, so I can see the temptation.

The Aliens/The Unbearable Lightness of Being classification space of movies

Still playing with the Group Lens movies data set, I implemented a couple of ideas from Shailesh Kumar, one of the Google researchers that came up with the logical itemset mining algorithm. That improved the clustering of movies quite a bit, and gave me the idea to "choose a basis," so to speak, and project these clusters into a more familiar Euclidean representation (although networks and clusters are fast becoming part of our culture's vernacular, interestingly).

This is what I did: I chose two movies from the data set, Aliens and The Unbearable Lightness of Being as the "basis vectors" of the "movie space." For every other movie in the data set, I found the shortest path between the movie and each basis vector on the weighted graph in the logical itemset mining algorithm that underlies the final selection of clusters. That gave me a couple of coordinates for each movie (its "distance from Aliens" and "distance from The Unbearable..."). Rounding coordinates to integers and choosing an small sample that covers the space well, here's a selected map of "movie space" (you will want to click on it to see it at full size):


Agreeably enough, this map has a number of features you'd expect from something like this, as well as some interesting (to me) quirks:

  • There is no movie that is close to both basis movies (although if anybody wants to produce The Unbearable Lightness of Chestbursters, I'd love to write that script).
  • The least-The Unbearable... of the similar-to-Aliens movies in this sub-sample is Raiders of the Lost Ark, which makes sense (it's campy, but it's still an adventure movie).
  • Dangerous Liaisons isn't that far from The Unbearable.., but as far away as you can get from Aliens.
  • Wayne's World is way out there.

It's fun to imagine the use of geometrical analogies to use this kind of mapping for practical applications. For example, movie night negotiation between two or more people could be approached as finding the movie vector with the lowest euclidean norm among the available options, where the basis is the set of each person's personal choice or favorite movie, and so on.

Latent mini-clusters of movies

Still playing with logical itemset mining, I downloaded one of the data sets from Group Lens that records movie ratings from MovieLens. The basic idea is the same as with clustering drug side effects: movies that are consistently ranked similarly by users are linked, and clusters in this graph suggest "micro-genres" of homogeneous (from a ratings POV) movies.

Here are a few of the clusters I got, practically with no fine-tuning of parameters:

  • Parts II and III of the Godfather trilogy
  • Ben-Hur and Spartacus
  • The first three Indiana Jones movies
  • Dick Tracy, Batman Forever, and Batman Returns.
  • The Devil's Advocate and The Game.
  • The 60's Lolita, the 1997 remake, and 1998's Return to Paradise.
  • The first two Karate Kid movies.
  • Analyze This and Analyze That.
  • The 60's Lord of the Flies, the 1990 remake, and 1998's Apt Pupil

As movie clusters go, these are not particularly controversial; I found it interesting how originals and sequels or remakes seemed to be co-clustered, at least superficially. And thinking about it, clustering Apt Pupil with both Lord of the Flies movies is reasonable...

Media recommendation is by now a relatively mature field, and no single, untuned algorithm is going to be competitive against what's already deployed. However, given the simplicity and computational manageability of basic clustering and recommendation algorithms, I expect they'll become even more ubiquitous over time (pretty much as how autocomplete in input boxes did).

Finding latent clusters of side effects

One of the interesting things about logical itemset mining, besides its conceptual simplicity, is the scope of potential applications. Besides the usual applications finding useful common sets of purchased goods or descriptive tags, the underlying idea of mixture-of, projections-of, latent [subsets] is a very powerful one (arguably, the reason why experiment design is so important and difficult is that most observations in the real world involve partial data from more than one simultaneous process or effect).

To play with this idea, I developed a quick-and-dirty implementation of the paper's algorithm, and applied it to the data set of the paper Predicting drug side-effect profiles: a chemical fragment-based approach. The data set includes 1385 different types of side effects potentially caused by 888 different drugs. The logical itemset mining algorithm quickly found the following latent groups of side effects:

  • hyponatremia, hyperkalemia, hypokalemia
  • impotence, decreased libido, gynecomastia
  • nightmares, psychosis, ataxia, hallucinations
  • neck rigidity, amblyopia, neck pain
  • visual field defect, eye pain, photophobia
  • rhinitis, pharyngitis, sinusitis, influenza, bronchitis

The groups seem reasonable enough (although hyperkalemia and hypokalemia being present in the same cluster is somewhat weird to my medically untrained eyes). Note the small size of the clusters and the specificity of the symptoms; most drugs induce fairly generic side effects, but the algorithm filters those out in a parametrically controlled way.

Tom Sawyer, Bilingual

Following a friend's suggestion, here's a comparison of phrase length distributions between the English and German versions of The Adventures of Tom Sawyer:

Tom Sawyer Phrase Lengths

It could be interesting to parametrize these distributions and try to characterize languages in terms of some sort of encoding mechanism (e.g., assume phrase semantics are drawn randomly from a language-independent distribution and renderings in specific languages are mappings from that distribution to sequences of words, and handwave about what cost metric the mapping is trying to minimize).

A first look at phrase length distribution

Here's a sentence length vs. frequency distribution graph for Chesterton, Poe, and Swift, plus Time of Punishment.

Phrase length distribution

A few observations:

  • Take everything with a grain of salt. There are features here that might be artifacts of parsing and so on.
  • That said, it's interesting that Poe seems to fancy short interjections more than Chesterton does (not as much as I do, though).
  • Swift seems to have a more heterogeneous style in terms of phrase lengths, compared with Chesterton's more marked preference for relatively shorter phrases.
  • Swift's average sentence length is about 31 words, almost twice Chesterton's 18 (Poe's is 21, and mine is 14.5). I'm not sure how reasonable that looks.
  • Time of Punishment's choppy distribution is just an artifact of the low number of samples.

The Premier League: United vs. City championship chances

Using the same model as previous posts (and, I'd say, not going against any intuition), the leading candidate to winning the Premier League is Manchester United, with approx. 88% chances. Second is Manchester City, with a bit over 11%. The rest of the teams with nonzero chances: Arsenal, Chelsea, Everton, Liverpool, Tottenham, and West Brom (with Chelsea, the best-positioned of these dark horses, clocking in at about half of a percentage point).

Personally, I'm happy about these very low-odds teams; I don't think any of them is likely to win (that's the point), but on the other hand, they have mathematical chances of doing so, and it's important for a model never to give zero probability to non-impossible events (modulo whatever precision you are working with, of course).

Chesterton's magic word squares

Here are the magic word squares for a few of Chesterton's books. Whether and how they reflect characteristics that differentiate them from each other is left as an exercise to the reader.


the same way of this
world was to it has
and not think would always
i have been indeed believed
am no one thing which

The Man Who Was Thursday

the man of this agreement
professor was his own you
had the great president are
been marquis started up as
broken is not to be

The Innocence of Father Brown

the other side lay like
priest in that it one
of his is all right
this head not have you
agreement into an been are

The Wisdom of Father Brown

the priest in this time
other was an agreement for
side not be seen him
explained to say you and
father brown he had then

Barcelona and the Liga, or: Quantitative Support for Obvious Predictions

I've adapted the predictive model to look at the Spanish Liga. Unsurprisingly, it's currently giving Barcelona a 96.7% chance of winning the title, with Atlético a far second place with 3.1%, and Real Madrid less than 0.2% (I believe the model still underestimates small probabilities, although it has improved in this regard.)

Note that around the 9th round or so, the model was giving Atlético an slightly higher chance of winning the tournament than Barcelona's, although that window didn't last more than a round.

Magic Squares of (probabilistically chosen) Words

Thinking about magic squares, I had the idea of doing something roughly similar with words, but using usage patterns rather than arithmetic equations. I'm pasting below an example, using statistical data from Poe's texts:


the same manner as if
most moment in this we
intense and his head were
excitement which i have no
greatly he could not one

The word on the top-left cell in the grid is the most frequently used in Poe's writing, "the" — unsurprisingly so, as it's the most frequently used word in the English language. Now, the word immediately to its right, "same," is there because "same" is one of the words that follows "the" most often in the texts we're looking at. The word below "the" is "most" because it also follows "the" very often. "Moment" is set to the right of "most" and below "same" because it's the word that most frequently follows both.

The same pattern is used to fill the entire 5-by-5 square. If you start at the topmost left square and then move down and/or to the right, although you won't necessarily be constructing syntactically correct phrases, the consecutive word pairs will be frequent ones in Poe's writing.

Although there are no ravens or barely sublimated necrophilia in the matrix, the texture of the matrix is rather appropriate, if not to Poe, at least to Romanticism. To convince you of that, here are the equivalent 5-by-5 matrices for Swift and Chesterton.


the world and then he
same in his majesty would
manner a little that it
of certain to have is
their own make no more


the man who had been
other with that no one
and his it said syme
then own is i could
there are only think be

At least compared against each other, it wouldn't be too far fetched to say that Poe's matrix is more Poe's than Chesterton's, and vice versa!

PS: Because I had a sudden attack of curiosity, here's the 5-by-5 matrix for my newest collection of short stories, Time of Punishment (pdf link).

Time of Punishment

the school whole and even
first dance both then four
charge rants resistance they think
of a hundred found leads
punishment new astronauts month sleep

The Torneo Inicial 2012 in one graph (and 20 subgraphs)

Here's a graph showing how the probability of winning the Argentinean soccer championship changed over time for each team (time goes from left to right, and probability goes from 0 at the bottom to 1 at the top). Click on the graph to enlarge:

Hindsight being 20/20, it's easy to read too much into this, but it's interesting to note that some qualitative features of how journalism narrated the tournament over time are clearly reflected in these graphs: Velez' stable progression, Newell's likelihood peak mid-tournament, Lanús quite drastic drop near the end, and Boca's relatively strong beginning and disappointing follow-through.

As an aside, I'm still sure that the model I'm using handles low-probability events wrong; e.g., Boca still had mathematical chances almost until the end of the tournament. That's something I'll have to look into when I have some time.

Soccer, Monte Carlo, and Sandwiches

As Argentina's Torneo Inicial begins its last three rounds, let's try to compute the probabilities of championship for each team. Our tools will be Monte Carlo and sandwiches.

The core modeling issue is, of course, trying to estimate the odds of team A defeating team B, given their recent history in the tournament. Because of the tournament format, teams only face each other one per tournament, and, because of the recent instability of teams and performance, generally speaking, performances in past tournaments won't be very good guides (this is something that would be interesting to look at in more detail). We'll use the following oversimplifications intuitions to make it possible to compute quantitative probabilities:

  • The probability of a tie between two teams is a constant that doesn't depend on the teams.
  • If team A played and didn't lose against team X, and team X played and didn't lose against team B, this makes it more likely than team A won't lose against team B (e.g., a "sandwich model").

Guided by these two observations, we'll take the results of the games in which a team played against both A and B as samples from a Bernoulli process with unknown parameter, and use this to estimate the probability of any previously unobserved game.

Having a way to simulate a given match that hasn't been played yet, we'll calculate the probability of any given team wining the championship by simulating the rest of the championship a million times, and observing in how many of these simulations each team wins the tournament.

The results:

Team Championship probability
Vélez Sarfield 79.9%
Lanús 20.1%

Clearly our model is overly rigid — it doesn't feel at all realistic to say that those two teams are the only with any change of winning the champsionship. On the other hand, the balance of probabilities between both teams seems more or less in agreement with the expectations of observers. Given that the model we used is very naive, and only uses information from the current tournament, I'm quite happy with the results.

A Case in Stochastic Flow: Bolton vs Manchester City

A few days ago the Manchester City Football Club released a sample of their advanced data set, an xml file giving a quite detailed description of low-level events in last year's August 21 Bolton vs. Manchester City game, which was won by the away team 3-2. There's an enormous variety of analyses that can be performed with this data, but I wanted to start with one of the basic ones, the ball's stochastic flow field.

The concept underlying this analysis is very simple. Where the ball will be in the next, say, ten seconds, depends on where it is now. It's more likely that it'll be near than it is that it'll be far, it's more likely that it'll be on an area of the field where the team with possession is focusing their attack, and so on. Thus, knowing the probabilities for where the ball will be starting from each point in the field — you can think of it as a dynamic heat map for the future — together with information about where it spent the most time, gives us information about how the game developed, and the teams' tactics and performance.

Sadly, a detailed visualization of this map would require at least a four-dimensional monitor, so I settled for a simplified representation, splitting the soccer field in a 5x5 grid, and showing the most likely transitions for the ball from one sector of the field to another. The map is embedded below; do click on it to expand it, as it's not really useful as a thumbnail.

Remember, this map shows where the ball was most likely to go from each area of the field; each circle represents one area, with the circles at the left and right sides representing the area all the way to the end lines. Bigger circles signal that the ball spent more time in that area, so, e.g., you can see that the ball spent quite a bit of time in the midfield, and very little on the sides of Manchester City's defense line. The arrows describe the most likely movements of the ball from one area to another; the wider the line, the most likely the movement. You can see how the ball circulated side-to-side quite a bit near Bolton's goal, while Manchester City kept the ball moving further away from their goal.

There are many immediate questions that come to mind, even with such a simplified representation. How does this map look according to which team had possession? How did it change over time? What flow patterns are correlated with good or bad performance on the field? The graph shows the most likely routes for the ball, but which ones were the most effective, that is, more likely to end up in a goal? Because scoring is a rare event in soccer, particularly compared with games like tennis or american football, this kind of analysis is specially challenging, but also potentially very useful. There's probably much that we don't know yet about the sport, and although data is only an adjunct to well-trained expertise, it can be a very powerful one.

Washington DC and the murderer's work ethic

Continuing what has turned out to be a fun hobby of looking at crime data for different cities (probably among the most harmless of crime-related hobbies, as long as you aren't taking important decisions based on naive interpretations of badly understood data), I went to and downloaded Crime Incident data for the District of Columbia for the year of 2011.

Mapping it was the obvious move, but I already did that for Chicago (and Seattle, although there were issues with the data, so I haven't posted anything yet), so I looked at an even more basic dimension: the time series of different types of crime.

To begin with, here's the week-by-week normalized count of thefts (not including burglary, and thefts from cars) in Washington DC (click to enlarge):

I normalized this series by shifting it to its mean and scaling it by its standard deviation — not because the data is normally distributed (it actually shows a thick left tail), but because I wanted to compare it with another data series. After all, the form of the data, partial as it is, suggests seasonality, and as the data covers a year, it wants to be checked against, say, local temperatures.

Thankfully NOAA offers just this kind of data (through about half a dozen confusingly overlapping interfaces), so I was able to add to the plot the mean daily temperature for DC (normalized in the same way as the theft count):

The correlation looks pretty good! (0.7 adj. R squared, if you must know it). Not that this proves any sort of direct causal chain, that's what controlled experiments are for, but we can postulate, e.g., a naive story where higher temperatures mean more foot traffic (I've been in DC in winter, and the neoclassical architecture is not a good match for the latitude), and more foot traffic leads to richer pickings for thieves (an interesting economics aside: would this mean that the risk-adjusted return to crime is high enough that crime is, as it were, constrained by the victims supply?)

Now let's look at murder.

The homicide time series is quite irregular, thanks to a relatively low (for, say Latin American values of 'low') average homicide count, but it's clear enough that there isn't a seasonal pattern to homicides, and no correlation with temperature (a linear fitting model confirms this, not that it was necessary in this case). This makes sense if we imagine that homicide isn't primarily an outdoors activity, or anyway that your likelihood of being killed doesn't increase as you spend more time on the street (most likely, whoever wants to kill you is motivated by reasons other than, say, an argument over street littering). Murder happens come rain or snow (well, I haven't checked that; is there an specific murder weather?)

Another point of interest is the spike of (weather-normalized) theft near the end of the year. It coincides roughly with Thanksgiving, but if that's the causal link, I'd be interested in knowing exactly what's going on.

How Rooney beats van Persie, or, a first look at Premier League data

I just got one of the data sets from the Manchester City analytics initiative, so of course I started dipping my toe in it. The set gives information aggregated by player and match for the 2011-2012 Premier League, in the form of a number of counters (e.g. time played, goals, headers, blocked shots, etc); it's not the really interesting data set Manchester City is about to release (with, e.g., high-resolution position information for each player), but that doesn't mean there aren't interesting things to be gleaned from it.

The first issue I wanted to look at is probably not the most significant in terms of optimizing the performance of a team, but it's certainly one of the most emotional ones. Attackers: Who's the best? Who's underused? Who sucks?

If you look at total goals scored, the answer is easy: the best attackers are van Persie (30 goals), Rooney (27 goals), and Agüero (23 goals). Controlling by total time played, though, Berbatov and both Cissés have been quite more efficient in goals scored by minute played. They are also, not coincidentally, the most efficient scorers in terms of goals per shoot (both on and off target). The 30 goals of van Persie, for example, are more understandable when you see that he shot 141 times for a goal, versus Berbatov's 15.

To see how shooting efficiency and shooting volume (number of shoots) interact with each other, I made this scatterplot of goals per shoot versus shoots per minute, restricted to players who regularly shoot to avoid low-frequency outliers (click to expand).

You can see that most players are more or less uniformly distributed in the lower-left quadrant of low shooting volume and low shooting efficiency — people who are regular shooters, so they don't try too often or too seldom. But there are outliers, people who shoot a lot, or who shoot really well (or aren't as closely shadowed by defenders)... and they aren't the same. This suggests a question: Who should shoot less and pass more? And who should shoot more often and/or get more passes?

To answer that question (to a very sketchy first degree approximation), I used the data to estimate a lost goals score that indicates how many more goals per minute could be expected if the player made a successful pass to an average player instead of shooting for a goal (I know, the model is naive, there are game (heh) theoretic considerations, etc; bear with me). Looking at the players through this lens, this is a list of players who definitely should try to pass a bit more often: Andy Carroll, Simon Cox, and Shaun Wright-Phillips.

Players who should be receiving more passes and making more shots? Why, Berbatov and both Cissés. Even Wayne Rooney, the league's second most prolific shooter, is good enough turning attempts into goals that he should be fed the ball more often, rather than less.

The second-order question, and the interesting one for intra-game analysis, is how teams react to each other. To say that Manchester United should get the ball to Rooney inside strike distance more often, and that opposing teams should try to prevent this, is as close to a triviality as can be asserted. But whether or not an specific change to a tactical scheme to guard Rooney more closely will be a net positive or, by opening other spaces, backfire... that will require more data and a vastly less superficial analysis.

And that's going to be so much fun!

Crime in Argentina

As a follow-up to my post on crime patterns in Chicago, I wanted to do something similar for Argentina. I couldn't find data at the same level of detail, but the people of Junar, who develop and run an Open Data platform, were kind enough to point me to a few data sets of theirs, including one that lists crime reports by type across Argentinean provinces for the year 2007.

The first issue I wanted to see was the relationship between different types of crime. Of course, properly speaking you need far more data, and a far more sophisticated and domain-specific analysis, to even begin to address the question, but you can at least see what types of crime tend to happen (or to be reported) in the same provinces. Here's a dendogram showing the relationships between crimes (click to expand it):

As you can see, crimes against property and against the state tend to happen in the same provinces, while more violent crimes (homicide, manslaughter, and kidnapping) are more highly correlated with each other. Drugs, which may or may not surprise you, are more correlated with property crimes than with violent crimes. Sexual crimes are not correlated, at least at the province level, with either cluster or crimes.

This observation suggests that we can plot provinces on the property crimes/sexual crimes space, as they seem to be relatively independent types of crime (at least at the province level). I added the line that marks a best fit linear relationship between both types of crime (mostly related, we'd expect, through their populations).

A few observations from this graph:

  • The bulk of provinces (the relatively small ones) are on the lower left corner of the graph, mostly below the linear relationship line. The ones above the line, with a higher rate of sexual crimes as expected from the number of property crimes, are provinces on the North.
  • Salta has, unsurprisingly but distressingly, almost four times the number of sexual crimes than expected by the linear relationship. Córdoba, the Buenos Aires province, and, to a lesser degree, Santa Fé, have also higher-than-expected numbers.
  • Despite ranking fourth in terms of absolute number of sexual crimes, the City of Buenos Aires has much fewer than the number of property crimes would imply (or, equivalently, has a much higher number of property crimes than expected).

Needlessly to say, this is but a first shallow view, using old data with poor resolution, of an immensely complex field. But looking at data, through never the only or last step when trying to understand something, it's almost always a necessary one, and it never fails to interest me.

Chicago and the Tree of Crime

After playing with a toy model of surveillance and surveillance evasion, I found the City of Chicago's Data Portal, a fantastic resource with public data including the salaries of city employees, budget data, the location of different service centers, public health data, and quite detailed crime data since 2001, including the relatively precise location of each reported crime. How could I resist playing with it?

To simplify further analysis, let's quantize the map into a 100x100 grid. Here's, then, the overall crime density of Chicago (click to enlarge):

This map shows data for all crime types. One first interesting question is whether different crime types are correlated. E.g., do homicides tend to happen close to drug-related crimes? To look a this, I calculated the correlation between the different types of crimes at the same point of the grid, and from that I built a "tree of crime." Technically called a dendogram, this kind of plot is akin to a phylogenetic tree, and in fact it's often used to show evolutionary relationships. In this case, the tree shows the closeness or not, in terms of geographical correlation, between types of crimes: the closer two types of crime are in the tree, the more likely they are to happen in the same geographical area (click to enlarge).

A few observations:

  • I didn't clean up the data before analysis, as I was as interested in the encoding details as in the semantics. The fact that two different codes for offenses involving children are closely related in the dendogram is good news in terms of trusting the overall process.
  • The same goes for assault and battery; as expected, they tend to happen in the same places.
  • I didn't expect homicide and gambling to be so closely related. I'm sure there's something interesting (for laypeople like me) going on there.
  • Other sets of closely related crimes that aren't that surprising: sex offenses and stalking, criminal trespass and intimidation, and prostitution-liquor-theft.
  • I expected narcotics and weapons to be closely related, but what's arson doing in there with them? Do street-level drug sellers tend to work in the same areas where arson is profitable?

For law enforcement — as for everything else — data analysis is not a silver bullet, and pretending it is can lead to shooting yourself in the face with it (the mixed metaphor, I hope, is warranted by the topic). But it can serve as a quick and powerful way to pose questions and fight our own preconceptions, and, perhaps specially in highly emotional issues like crime, that can be a very powerful weapon.

Bad guys, White Hat networks, and the Nuclear Switch

Welcome to Graph City (a random, connected, undirected graph), home of the Nuclear Switch (a distinguished node). Each one of Graph City's lawful citizens belongs to one of ten groups, characterized by their own stochastic movement patterns on the city. What they all have in common is that they never walk into the Nuclear Switch node.

This is because they are lawful, of course, and also because there's a White Hat network of government cameras monitoring some of the nodes in Graph City. They can't read citizen's thoughts (yet), but they know whether a citizen observed on a node is the same citizen that was observed on a different node a while ago, and with this information Graph City's government can build an statistical model of the movement of lawful citizens (as observed through the specific network of cameras).

This is what happens when random-walking, untrained bad guys (you know they are bad guys because they are capable of entering the Nuclear Switch node) start roaming the city (click to expand):

Attempts by untrained bad guys

Between half and twenty percent of the intrusion attempts succeed, depending on the total coverage of the White Hat Network (a coverage of 1.0 meaning that every node in the city has a camera linked to the system). This isn't acceptable performance in any real-life application, but this being a toy model with unrealistically small and simplified parameters, absolute performance numbers are rather meaningless.

Let's switch sides for a moment now, and advise the bad guys (after all, one person's Nuclear Switch is another's High Target Value, Market Influential, etc). An interesting first approach for bad guys would be to build a Black Hat Network, create their own model of lawful citizen's movements, and then use that systematically look for routes to the Nuclear Switch that won't trigger an alarm. The idea being, any person who looks innocent to the Black Hat Network's statistical model, will also pass unnoticed under the White Hat's.

This is what happens with bad guys trained using Black Hat Networks of different sizes are sent after the Nuclear Switch:

Attempts by bad guys trained on the BHN

Ouch. Some of the bad guys get to the Nuclear Switch on every try, but most of them are captured. A good metaphor for what's going on here could be that the White Hat Network and the Black Hat Network's filters are projections on orthogonal planes of a very high dimensional set of features. The set of possible behaviors for good and bad guys is very complex, so, unless your training set is comprehensive (something generally unfeasible), you can not have a filter that works very well on your training data and very poorly on a new observation — this is the bane of every overenthusiastic data analyst with a computer &mndash; but you can train two filters to detect the same subset of observations using the same training set, and have them be practically uncorrelated when it comes to new observations.

In our case, this is good news for Graph City's defenders, as even a huge Black Hat Network, and very well trained bad guys, are still vulnerable to the White Hat Network's statistical filter. It goes without saying, of course, that if the bad guys get even read-only access to the White Hat Network, Graph City is doomed.

Attempts by bad guys trained on the WHN

At one level, this is a trivial observation: if you have a good enough simulation of the target system, you can throw brute force at the simulation until you crack it, and then apply the solution to the real system with near total impunity (a caveat, though: in the real world, "good enough" simulations seldom are).

But, and this is something defenders tend to forget, bad guys don't need to hack into the White Hat Network. They can use Graph City as a model of itself (that's what the code I used above does), send dummy attackers, observe where they are captured, and keep refining their strategy. This is something already known to security analysts. Cf., e.g., Bruce Schneier — mass profiling doesn't work against a rational adversary, because it's too easy to adapt against. A White Hat Network could be (for the sake of argument) hack-proof, but it will still leak all critical information simply by the patter of alarms it raises. Security Through Alarms is hard!

As an aside, "Graph City" and the "Nuclear Switch" are merely narratively convenient labels. Consider graphs of financial transactions, drug traffic paths, information leakage channels, etc, and consider how many of our current enforcement strategies (or even laws) are predicated on the effectiveness of passive interdiction filters against rational coordinated adversaries...

A flow control structure that never makes mistakes (sorta)

I've been experimenting with Lisp-style ad-hoc flow control structures. Nothing terribly useful, but nonetheless amusing. E.g., here's a dobest() function that always does the best thing (and only the best thing) among the alternatives given to it — think of the mutant in Philip K. Dick's The Golden Man, or Nicolas Cage in the awful "adaptation" Next.

Here's how you use it:

if __name__ == '__main__':
    def measure_x():
        "Metric function: the value of x"
        global x
        return x
    def increment_x():
        "A good strategy: increment x"
        global x
        x += 1
    def decrement_x():
        "A bad strategy: decrement x"
        global x
        x -= 1
    def fail():
        "An even worse strategy"
        global x
        x = x / 0
    x = 1
    # assert(x == 1)
    dobest(measure_x, increment_x, decrement_x, fail)
    # assert(x == 2)

You give it a metric, a function that returns how good you think the current world is, and one or more functions that operate on the environment. Perhaps disappointingly dobest() doesn't actually see the future; rather, it executes each function on a copy of the current environment, and only transfers to the "real" one the environment with the highest value of metric().

Here's the ugly detail (do point out errors, but please don't mock too much; I haven't played much with Python scopes):

def dobest(metric, *branches):
    "Apply every function in *branches to a copy of the caller's environnment, only do 'for real' the best one according to the result of running metric()."
    world = copy.copy(dict(inspect.getargvalues(sys._getframe(1)).locals))
    alts = []
    for branchfunction in branches:
            # Run branchfunction in a copy of the world
            ns = copy.copy(world)
            exec branchfunction.__code__ in ns, {}
        except: # We ignore worlds where things explode
    # Sort worlds according to metric()
    alts.sort(key=lambda ns: eval(metric.__code__, ns, {}), reverse=True)
    for key in alts[0]:
        sys._getframe(1).f_globals[key] = alts[0][key]

One usability point is that the functions you give to dobest() have to explicitly access variables in the environment as global; I'm sure there are cleaner ways to do it.

Note that this also can work a bit like a try-except with undo, a la

dobest(bool, function_that_does_something, function_that_reports_an_error)

This would work like try-except, because dobest ignores functions that raise exceptions, but with the added benefit that dobest would clean up everything done by function_that_does_something.

Of course, and here's the catch, "everything" is kind of limited — I haven't precisely gone out of my way to track and catch all side effects, not that it'd even be possible without some VM or even OS support. Point is, the more I get my ass saved by git, the more I miss it in my code, or even when doing interactive data analysis with R. As the Doctor would say, working on just one timeline can be so... linear.

A mostly blank slate

The combination of a tablet and a good Javascript framework makes it very easy to deploy algorithms to places where so far they have been scarce, like meetings, notetaking, and so on. The problem lies in figuring out what those algorithms should be; just as we had to have PCs for a few years before we started coming up with things to do with them (not that we have even scratched the surface), we still don't have much of a clue about how to use handheld thinking machines outside "traditional thinking machine fields."

Think about it this way: computers have surpassed humans in the (Western Civilization's) proverbial game of strategy and conflict, chess, and are useful enough in games of chance that casinos and tournament organizers are prone to use anything from violence to lawyers to keep you from using them. So the fact that we aren't using a computer when negotiating or coordinating says something about us.

The bottleneck, Cassius would say nowadays, is not in our tech, but in our imaginations.

The perfectly rational conspiracy theorist

Conspiracy theorists don't have a rationality problem, they have a priors problem, which is a different beast. Consider a rational person who believes in the existence of a powerful conspiracy, and then reads an arbitrary online article; we'll denote by C the propositions describing the conspiracy, and by a the propositions describing the article's content. By Bayes' theorem,

P(C|a) = \frac{P(a|C) P(C)}{P(a)}

Now, the key here is that the conspiracy is supposed to be powerful. A powerful enough conspiracy can make anything happen or look like it happened, and therefore it'll generally be the case that P(a|C) \geq P(a) (and usually P(a|C) > P(a) for low-probability a, of which there are many in these days, as Stanislaw Lem predicted in The Chain of Chance). But that means that in general P(C|a) \geq P(C), and often P(C|a) > P(C)! In other words, the rational evaluation of new evidence will seldom disprove a conspiracy theory, and will often reinforce its likelihood, and this isn't a rationality problem — even a perfect Bayesian reasoner will be trapped, once you get C into its priors (this is a well-known phenomenon in Bayesian inference; I like to think of these as black hole priors).

Keep an eye open, then, for those black holes. If you have a prior that no amount of evidence can weaken, then that's probably cause for concern, which is but another form of saying that you need to demand falsifiability in empirical statements. From non-refutable priors you can do mathematics or theology (both of which segue into poetry when you are doing them right), but not much else.

Fractals are unfair

Let's say you want to identify an arbitrary point in a segment I (chosen with an uniform probability distribution). A more or less canonical way to do this is to split the segment in two equal halves, and write down a bit identifying which half; now the size of the set where the point is hidden is one half of what it was. Because the half-segment we chose is affinely equivalent to the original one, we can repeat this as much as we want, gaining one bit of precision (halving the size of the "it's somewhere around here" set) for each bit of description. Seems fair.

It's easy to do the same on a square I in R^2. Split the square in four squares, write down two bits to identify the one where the point is, repeat at will. Because each square has one fourth the size of the enclosing one, you gain two bits of precision for each two bits of description. Still fair (and we cannot do better).

Now try to do this on a fractal, say Koch's curve, and things go as weird as you'd think. You can always split it in four affinely equivalent pieces, but each of them is one-third the size of the original one, which means that you gain less than two bits of precision for each two bits of description. Now this is unfair. A fractal street would be a very good way of putting in an infinite number of houses inside a finite downtown, but (even approximate) driving directions will be longer than you'd think they should.

Of course, this is merely a paraphrasing of the usual definition of a fractal, which is an object whose fractal dimension exceeds its topological dimension (very hand-wavingly speaking, the number of bits of description in each step is higher than the number of bits of precision that you gain). But I do enjoy looking at things through an information theory lens.

Besides, there's a (still handwavy but clear) connection here with chaos, through the idea of trying to pin down the future of a system in phase space by pinning down its present. In conservative systems this is fair: one bit of precision in the present, gives you one bit of precision for the future (after all, volumes are preserved). But when chaos is involved this is no longer the case! For any fixed horizon, you need to put in a more bits of information about the present in order to get the same number of bits about the future.

Men's bathrooms are (quantum-like) universal computers

As it's well documented, men choose urinals to maximize their distance from already occupied urinals. Letting u_i be 1 or 0 depending on whether urinal $i$ is occupied, and setting \sigma_{i,j} to the distances between urinals i and j, male urinal choosing behavior can be seen as an annealing heuristic maximizing

\sum u_i u_j \sigma_{i,j}

(summing over repeating indexes as per Einstein's convention). And this is obviously equivalent to the computational model implemented by D-Wave's quantum computers! Hardware implementations of urinal-based computers might be less compact than quantum ones (and you might need bathrooms embedded in non-Euclidean spaces in order to implement certain programs), but they are likely to be no more error-prone, and they are certain to escalate better.

A quick look at Elance statistics

I collected data from Elance feeds, in order to find what employers are looking for on the site. It's not pretty: by far the most requested skills in terms of aggregated USD demand are article writing (generally "SEO optimized"), content, logos, blog posting, etc. In other words, mostly AdSense baiting with some smattering of design. It's not everything requested on Elance, of course, but it's a big part of the pie.

Not unexpected, but disappointing. Paying low wages to people in order to fool algorithms to get other people to pay a bit more might be a symbolically representative business model in an increasingly algorithm-routed and economically unequal world, but it feels like a colossal misuse of brains and bits.

First post: what I'm interested on these days

Here's a quick list of what I'm mildly obsessed about right now:

  • Data analysis, large-scale inference, augmented cognition — labels differ, but the underlying mathematics is often pretty much the same.
  • Smart, distributed, large-scale software/systems/markets/organizations — basically, the systematic application of inference-driven technologies.
  • The big challenges and the big opportunities: dealing with climate change, improving global information, policy, and financial systems, global health and education, application of (simultaneously) bio-neuro-cogno-cyber tech, and the thorny question of 19th century-minded politicians using 20th century governance systems to deal with 21st century problems.