Data is the new oil: best avoided if you can.
The importance, the necessity of data to develop and apply AI technology is more than a technical assumption: it pervades everything from discussions of ethical trade-offs to public and private decisions about which projects to undertake. More projects are probably delayed or never begun because "the data" is missing than because of lack of money or interest. This never refers to future data to be recorded and used by the new system, but rather to previously gathered data sets to be used to build the AI.
This asymmetry is, in a way, mathematically meaningless: most AI training algorithms can't tell whether information comes from an old data set or from a freshly recorded observation. Tautologically, statistically relevant information is statistically relevant information, period. And in fact some of the most effective AIs currently built, along the lines of AlphaZero, weren't trained with preexisting data but with interactive observations.
I believe this assumed dependency on having data before starting to build an AI is due various perceived or implicit forms of front-loading:
- Beginning with a data set reduces the need for task-specific early design. Gathering information about the domain of interest is assumed to have been taken care of by collecting the data set, so the only(?) thing left to do is the AI training itself. This separation of concerns is organizationally useful, and if you can get access to a data set for free or at low cost it can feel like a bargain.
- Using existing data sets to train AIs has also the benefit of front-loading risks: AIs can be worked on in the privacy of a company until they perform well on the data, and only then released to the public.
- Finally, it allows organizations to start working on AIs with little internal engagement. An "AI team" can start working using existing data without any commitment from the parts of the organization that will be involved in its deployment and use. This reduces the internal friction to be overcome to launch the project.
This isn't an awful model. The AI industry wouldn't be where it is without this development pattern, and there are areas — like NLP — where the past is so big and rich in data that it's the natural place to begin, or where isolated training is the only feasible approach.
But this isn't to say that the front-loading model doesn't have its own costs and limitations. We could categorize them in two groups: underestimated risks and opportunity costs.
Underestimated risks are problems that the front-loading strategy is assumed to help with but can in fact be worsened by it:
- Domain information: Having a data set is assumed to eliminate the risk of not having enough information to understand a system; in practice, unless a data set has been explicitly built for a given AI design problem, it's very unlikely to have complete or unbiased information about it.
- Deployment consistency: Developing an AI in isolation and then integrating it with the rest of the systems of an organization — and even more so with the rest of the world — are very different but not independent matters. AIs trained to perform well under certain assumptions about their "senses" (what information they will have access to) and "hands" (what actions they can take) can work very badly, or be impossible to implement, under the real constraints of flaky systems, reduced budgets, poor training, etc.
- Performance consistency: Testing an AI developed with a data set against other data sets can give statistical confidence only under very unrealistic assumptions about both the data sets and the stability of the world. Even AIs carefully trained to avoid biased or dangerous decisions can manifest problems as they are moved from data sets to interaction with the real world.
Some of these problems are already well understood by the industry, regulators, policy experts, etc., but they aren't always framed as being related to a data-set-first AI development model.
The other set of problems that comes from this model is simply that of opportunity costs: if you wait until you have data to build an AI, you'll either do it much later or won't do it at all, with often enormous costs to your stakeholders. It's not often mentioned as a problem because it's thought to be unavoidable and because the loss is potential and hence invisible, but, given the still very low prevalence of AI projects in most organizations, I believe it's systemically significant.
Is an alternative, or rather a complementary AI development model at all possible? Yes. Unlike the typical examples of AI development, a large number of practical problems (including many in management, strategy, education, etc) have the following characteristics:
- Even without any data set, we already know a lot about them.
- In fact, we know how to do them _without_ AIs - only not as well, we hope, as we can do it with (good) AIs.
- We care about contextual rather than universal solutions; in other words, we are building the AI as an internal tool or process, not as a product to sell.
This allows organizations to flip the AI development pipeline:
- First build a zero-data AI that simply represents the standard knowledge they already have.
- Integrate this AI with the existing processes and systems in a way that makes it advisory to or controlled by the existing process (that is, it's not assumed to work well because it's known not to).
- Use ongoing interactions of the AI with the rest of the system and the world to keep refining it, and making it gradually have more influence as its performance improves.
This model will not always be possible or optimal, as it lacks some of the advantages of the existing front-loading model: You have to dedicate resources to build and integrate an AI that will have no impact at the beginning and that might not work over the long term, but you'll only know this after it has been running for a while.
But it also has advantages: The organization has to solve first, because it has to do first, the problems of deployability and safety. It's not deploying an AI that it assumes works well but adds half-heartedly some guardrails around (if it has time and budget, which often means after something bad happened). It's deploying something it knows doesn't work, so the guardrails have to be there from the beginning. And the AI won't be trained with information or context irrelevant to the organization, because it'll be trained precisely in the context and with the information it'll work with.
Most importantly, it's a process that avoids having to have a data set at the beginning. This expands enormously the range of possible applications and organizations: we have comparatively little publicly usable information about interesting processes, and few organizations have the resources to deliberately collect them at the "big data" scale. Building AIs before having the data requires more knowledge about the systems it'll interact with, and more patience until you start seeing results, but it's a safer path and for many organizations the only, if untrodden, path.
It's not, in general, the only possible model, and front-loading and data-set-free aren't mutually exclusive approaches: if you have relevant data, it's malpractice not to use it. But any organization that's not deploying AIs in key processes because they "don't have the data yet" is locking themselves out of a potential development path that, with its own trade-offs, can be just as effective as the usual one.