Artificial Intelligence without the Data

2022-02-16

Data is the new oil: best avoided if you can.

The importance, the necessity of data to develop and apply AI technology is more than a technical assumption: it pervades everything from discussions of ethical trade-offs to public and private decisions about which projects to undertake. More projects are probably delayed or never begun because "the data" is missing than because of lack of money or interest. This never refers to future data to be recorded and used by the new system, but rather to previously gathered data sets to be used to build the AI.

This asymmetry is, in a way, mathematically meaningless: most AI training algorithms can't tell whether information comes from an old data set or from a freshly recorded observation. Tautologically, statistically relevant information is statistically relevant information, period. And in fact some of the most effective AIs currently built, along the lines of AlphaZero, weren't trained with preexisting data but with interactive observations.

I believe this assumed dependency on having data before starting to build an AI is due various perceived or implicit forms of front-loading:

This isn't an awful model. The AI industry wouldn't be where it is without this development pattern, and there are areas — like NLP — where the past is so big and rich in data that it's the natural place to begin, or where isolated training is the only feasible approach.

But this isn't to say that the front-loading model doesn't have its own costs and limitations. We could categorize them in two groups: underestimated risks and opportunity costs.

Underestimated risks are problems that the front-loading strategy is assumed to help with but can in fact be worsened by it:

Some of these problems are already well understood by the industry, regulators, policy experts, etc., but they aren't always framed as being related to a data-set-first AI development model.

The other set of problems that comes from this model is simply that of opportunity costs: if you wait until you have data to build an AI, you'll either do it much later or won't do it at all, with often enormous costs to your stakeholders. It's not often mentioned as a problem because it's thought to be unavoidable and because the loss is potential and hence invisible, but, given the still very low prevalence of AI projects in most organizations, I believe it's systemically significant.

Is an alternative, or rather a complementary AI development model at all possible? Yes. Unlike the typical examples of AI development, a large number of practical problems (including many in management, strategy, education, etc) have the following characteristics:

This allows organizations to flip the AI development pipeline:

This model will not always be possible or optimal, as it lacks some of the advantages of the existing front-loading model: You have to dedicate resources to build and integrate an AI that will have no impact at the beginning and that might not work over the long term, but you'll only know this after it has been running for a while.

But it also has advantages: The organization has to solve first, because it has to do first, the problems of deployability and safety. It's not deploying an AI that it assumes works well but adds half-heartedly some guardrails around (if it has time and budget, which often means after something bad happened). It's deploying something it knows doesn't work, so the guardrails have to be there from the beginning. And the AI won't be trained with information or context irrelevant to the organization, because it'll be trained precisely in the context and with the information it'll work with.

Most importantly, it's a process that avoids having to have a data set at the beginning. This expands enormously the range of possible applications and organizations: we have comparatively little publicly usable information about interesting processes, and few organizations have the resources to deliberately collect them at the "big data" scale. Building AIs before having the data requires more knowledge about the systems it'll interact with, and more patience until you start seeing results, but it's a safer path and for many organizations the only, if untrodden, path.

It's not, in general, the only possible model, and front-loading and data-set-free aren't mutually exclusive approaches: if you have relevant data, it's malpractice not to use it. But any organization that's not deploying AIs in key processes because they "don't have the data yet" is locking themselves out of a potential development path that, with its own trade-offs, can be just as effective as the usual one.