Quick link: The variable is not the thing

2024-10-27

The link: Measurement error and information bias in causal diagrams: mapping epidemiological concepts and graphical structures (International Journal of Epidemiology)

What it says: It describes how to use graphical causal models to think about inferential problems caused, in informal terms, by the fact that we often have measurements and labels, not direct access to the reality we're modeling. This introduces issues that we have to deal with to improve our understanding of what's going on, or at least to avoid some potentially catastrophic errors (that last bit is my comment, not the papers').

Why this matters: It's not any sort of new or sophisticated issue: I've been writing about this in the context of data analysis for at least ten years, and it's in fact a foundational problem not just in physics and engineering but also in philosophy. Still, data scientists are often pushed to ignore it in the name of faster and more "interesting" "insights." The simplest version of the problem in the context of data analysis can be summarized as: just because a variable in your data set is called something it doesn't mean that it's an accurate representation of what the same term refers to in the real world, and if you don't explicitly track down and model that relationship you can get in serious trouble. Obvious when explicitly put, but very seldom accounted for in actual projects, where the relationship between terms in slides and the real world can be as creative as any non-GAAP number (and often for similar reasons and with similar goals).

Side notes: