A useful way to think about AI/model training is to picture it as the problem of picking the best possible program (a fully parameterized network, say) from a bag of possibilities.
- The AI architecture/model specification defines how this bag looks like: whatever you deploy at the end of the project will come from here.
- The data you have or gain thru experiments and simulations will allow you to pick from the bag the best program for your purposes.
- The technical choices of algorithm, loss function, etc. describe how you're going to use the data to pick from the bag.
The importance of the data is well-known, oscillating between axiom and fetish, and every experienced engineer or data scientist will tell you that getting an algorithm to work in a given combination of data and model specification sometimes is a push-button task and sometimes it's a black art. Much less attention is given to the shape of the bag, which is a problem because, to repeat
Whatever you deploy at the end of a project has to be in the bag of potential outcomes to begin with.
The first implication is that the bag has to be large enough, that is, your AI architecture or model specification has to be expressive enough, to include a program that reflects well enough what you're trying to understand or do in the real world; you won't get a good model for an inherently non-linear process from the bag of all linear models no matter how much data and compute power you throw at it.
On the other but equally important hand, larger bags with more expressive possibilities need exponentially more data to get good results from: they have so many possible ways to generate smaller data sets, they need a large sample before the set of plausible choices narrows down. This happens almost all the time! It's much easier to deploy and train a more complex architecture than it is to get more data — the first part is push-button, often just an API call away — so, relative to the complexity of the models they are training, the vast majority of companies have much less data than they think.
Worse, it's not just a matter of size. You don't just want the bag to hold a lot of potential outcomes, you need it to include at least one that's useful for what you are doing. Those aren't exactly the same thing: a complex nonlinear model will in almost all realistic scenarios be very bad at modeling a simple linear system, and no matter how large your neural network, you'll never get a good result from it if its architecture is at odds with the structure of the problem.
This is one of the fundamental trade-offs in AI design at whatever scale and with whatever technology:
- A. The bag has to have inside a good program.
- B. Conditional on (A), you want it to be as small as possible.
- C. The less you know about what a good program will look like — i.e. the less you know about the process you're modeling — the larger the bag will have to be to have a good chance of getting (A).
When you know a lot about a problem, e.g. when you're modeling a well-understood physics problem with a strong theory, you can design a very small bag with just the right programs, and get good results with very little data. Sometimes, of course, you don't know much about a problem and then you take the biggest bag you can with the most flexible models you can afford, throw at it as much data as you can get, keep your AI engineers caffeinated enough for everything to work in a technical sense, and hope the end result is useful. The most famous AI areas right now, natural language and image generation and analysis, are like that. Their astounding scale are monumental engineering triumphs, but they are also declarations of ignorance: we had to do it that way because after some decades of attempts it turned out we just don't know enough about how language and vision work to do it in any other way.
Most projects in most companies are somewhere in the middle. Extremely well-understood problems are already automated, and extremely complex problems nobody knows anything about rarely have companies working on them. Unless you're in a research lab, whatever you're doing is something people has been working on for years, decades, or even more. They might not think about it in those terms, but they have figured out much about the shape of the bag you'll want to work with: they know variables that matter and variables that don't, they have figured out groups of cases that have very dissimilar behaviors and how to tell them apart, they have learned the hard way about dead ends. Without that knowledge you are essentially attempting to replicate all that time and work: because you don't know the shape of the bag you will either build models much primitive than what they already know (which won't impress anybody much, and can shut you off a market as soon as the "but it's AI" shine goes away), or you will start with overly flexible architectures that your data can't properly pick a good program from — assuming there is one in the bag — which often shows up as models that work more or less well and pass testing, but then degrade quickly in the field and blow up the moment they encounter something every expert in the field knows about but that didn't happen to be in your data.
None
The resolution of this tension isn't technically hard, it's just culturally alien to the ethos of IT companies in general and of AI in particular. The promise is that if you have enough data, computing power, and AI engineers, you don't need that "legacy know-how." But in that legacy know-how is the knowledge that can make the difference between a long, expensive AI project that delivers everything except real-world performance, and a more compact one that begins somewhere near the limit of what's currently possible and then try to take a step forward.
None
For AI engineers and similar roles, the take-away is that the most useful optimization skill — and maybe the one with the longest shelf life — is being able to read on and digest knowledge about things that aren't AI and aren't even about computers. A couple of paragraphs of a review paper can be more effective for feature engineering than any amount of data, and starting from the right model specification/AI architecture — something that only reading on whatever it is you're trying to model can help you do — is much more important than the choice of framework or service.
For managers, the take-away is that data, computing power, and AI expertise aren't enough: your team needs knowledge. As a minimum, the first part of any project should have engineers and data scientists neither coding nor looking at the data but instead reading papers, books, and whatever else they can find about the problem. Going outside to look first hand at what's happening and even talking with the non-tech people making it happen, while almost unheard of, wouldn't be a bad idea either.
All of this is likely to need not just permission but encouragement. Stressed people don't learn well, so carrots are suggested over sticks: most important of all, explicitly allocate time to this, don't just add a note somewhere keeping the schedule as it was. Learning about something new takes time, and "think about the business problem first" is meaningless unless you give people the tools and time to learn about that business problem with enough depth to be able to translate this knowledge into meaningful AI architectural choices.
No code will be written during that time, no data will be analyzed, and meetings will be rather weird. But the final deliverable will be better in ways no technical effort alone could get.