How to be data-driven without data…

…and then make better use of the data you get.

The usefulness of data science begins long before you collect the first data point. It can be used to describe very clearly your questions and your assumptions, and to analyze in a consistent manner what they imply. This is neither a simple exercise nor an academic one: informal approaches are notoriously bad at handling the interplay of complex probabilities, yet even the a priori knowledge embedded in personal experience and publicly available research, when properly organized and queried, can answer many questions that mass quantities of data, processed carelessly, wouldn’t be able to, as well as suggest what measurements should be attempted first, and what for.

The larger the gap between the complexity of a system and the existing data capture and analysis infrastructure, the more important it is to set up initial data-free (which doesn’t mean knowledge-free) formal models as a temporary bridge between both. Toy models are a good way to begin this approach; as the British statistician George E.P. Box wrote, all models are wrong, but some are useful (at least for a while, we might add, but that’s as much as we can ask of any tool).

Let’s say you’re evaluating an idea for a new network-like service for specialized peer-to-peer consulting that will have the possibility of monetizing a certain percentage of the interactions between users. You will, of course, capture all of the relevant information once the network is running — and there’s no substitute for real data — but that doesn’t mean you have to wait until then to start thinking about it as a data scientist, which in this context means probabilistically.

Note that the following numbers are wrong: it takes research, experience, and time to figure out useful guesses. What matters for the purposes of this post is describing the process, oversimplified as it will be.

You don’t know a priori how large the network will be after, say, one year, but you can look at other competitors, the size of the relevant market, and so on, and guess, not a number (“our network in one year will have a hundred thousand users”), but the relative likelihood of different values.

The graph above shows one possible set of guesses. Instead of giving a single number, it “says” that there’s a 50% chance that the network will have at least a hundred thousand users, and a 5.4% chance that it’ll have at least half a million (although note that decimals points in this context are rather pointless; a guess based on experience and research can be extremely useful, but will rarely be this precise). On the other hand, there’s almost a 25% chance that the network will have less than fifty thousand users, and a 10% chance that it’ll have less than twenty-eight thousand.

How do you build such a graph, or rather, how do you assemble the information represented on it? The answer will probably look surprisingly old-fashioned: by learning as much as you can about the topic, talking with people who know about it, exercising your judgment, and then using formal mathematics to force yourself to write your best guess in a way that’s explicitly clear about what it says and what it doesn’t. The first steps are things you were already doing to help you with your problem, but the last one is what will allow you to coordinate knowledge and experience from different sources to give you the best possible answer to your question, given whatever you know at that moment.

You can use the same process to codify your educated guesses about other key aspects of the application, like the rate at which members of the network will interact, and the average revenue you’ll be able to get from each interaction. As always, neither these numbers nor the specific shape of the curves matter for this toy example, but note how different degrees and forms of uncertainty are represented through different types of probability distributions:

Clearly, in this toy model we’re sure about some things like the interaction rate (measured, say, in interactions per month), and very unsure about others, like the average revenue per interaction. Thinking about the implications of multiple uncertainties is one of the toughest cognitive challenges, as humans tend to conceptualize specific concrete scenarios: we think in terms of one or at best a couple of states of the world we expect to happen, but when there are multiple interacting variables, even the most likely scenario might have a very low absolute probability.

Simulation software, though, makes this nearly trivial even for the most complex models. Here’s, for example, the distribution of probabilities for the monthly revenue, as necessarily implied by our assumptions about the other variables:

There are scenarios where your revenue is more than USD 10M per month, and you’re of course free to choose the other variables so this is one of the handful of specific scenarios you describe (perhaps the most common and powerful of the ways in which people pitching a product or idea exploit the biases and limitations in human cognition). But doing this sort of quantitative analysis forces you to be honest at least to yourself: if what you know and don’t know is described by the distributions above, then you aren’t free to tell yourself that your chance of hitting it big is other than microscopic, no matter how clear the image might be in your mind.

That said, not getting USD 10M a month doesn’t mean the idea is worthless; maybe you can break even and then use that time to pivot or sell it, or you just want to create something that works and is useful, and then grow it over time. Either way, let’s assume your total costs are expected to be USD 200k per month (if this were a proper analysis and not a toy example, this wouldn’t be an specific guess, but another probability distribution based on educated guesses, expert opinions, market surveys, etc). How do probabilities look then?

You can answer this question using the same sort of analysis:

The inescapable consequence of your assumptions is that your chances of breaking even are 1 in 20. Can they be improved? One advantage of fully explicit models is that you can ask not just for the probability of something happening, but also about how things depend on each other.

Here are the relationships between the revenue, according to the model, and each of the main variables, with a linear best fit approximation superimposed:

As you can see, network size has the clearest relationship with revenue. This might look strange – wouldn’t, under this kind of simple model, multiplying by ten the number of interactions keeping the monetization rate also multiply by ten the revenue? Yes, but your assumptions say you can’t multiply the number of interactions by more than a factor of five, which, together with your other assumptions, isn’t enough to move your revenue very far. So it isn’t that it’s unreasonable to consider the option of increasing interactions significantly, to improve your chances of breaking even (or even getting to USD 10M). But if you plan to increase outside the explicit range encoded your assumptions, you have to explain why they were wrong. Always be careful when you do this: changing your assumptions to make possible something that would be useful if it were possible is one of humankind’s favorite ways of driving directly into blind alleys at high speed.

It’s key to understand that none of this is really a prediction about the future. Statistical analysis doesn’t really deal with predicting the future or even getting information about the present: it’s all about clarifying the implications of your observations and assumptions. It’s your job to make those observations and assumptions as good and releevant as possible, both not leaving out anything you know, and not pretending you know what you don’t, or that your are more certain about something that you should be.

This problem is somewhat mitigated for domains where we have vast amounts of information, including, recently, areas like computer vision and robotics. But we have yet to achieve the same level of data collection in other key areas like business strategy, so there’s no way of avoiding using expert knowledge… which doesn’t mean, as we saw, that we have to ditch quantitative methods.

Ultimately, successful organizations do the entire spectrum of analysis activities: they build high-level explicit models, encode expert knowledge, collect as much high-quality data as possible, train machine learning models based on that, and exploit all of that for strategic analysis, automatization, predictive modeling, etc. There are no silver bullets, but you probably have more ammunition than you think.