The link: Measurement error and information bias in causal diagrams: mapping epidemiological concepts and graphical structures (International Journal of Epidemiology)
What it says: It describes how to use graphical causal models to think about inferential problems caused, in informal terms, by the fact that we often have measurements and labels, not direct access to the reality we're modeling. This introduces issues that we have to deal with to improve our understanding of what's going on, or at least to avoid some potentially catastrophic errors (that last bit is my comment, not the papers').
Why this matters: It's not any sort of new or sophisticated issue: I've been writing about this in the context of data analysis for at least ten years, and it's in fact a foundational problem not just in physics and engineering but also in philosophy. Still, data scientists are often pushed to ignore it in the name of faster and more "interesting" "insights." The simplest version of the problem in the context of data analysis can be summarized as: just because a variable in your data set is called something it doesn't mean that it's an accurate representation of what the same term refers to in the real world, and if you don't explicitly track down and model that relationship you can get in serious trouble. Obvious when explicitly put, but very seldom accounted for in actual projects, where the relationship between terms in slides and the real world can be as creative as any non-GAAP number (and often for similar reasons and with similar goals).
Side notes:
- In practical terms, ignoring this problem is one of the mechanisms of social bias in data analysis. It's an insidious, indirect form of dataset bias.
- The LLM-driven AI craze has turned the nominalist trap from a hole we often fall into to the environment we inhabit. Mistaking statistical relationships between sequences extracted from a large corpus with effective inferences about real-world phenomena is the unspoken foundation of the whole bubble, one I'm afraid it's not going to be without toxic side effects (leaving aside the economic blast radius of the end of the hype cycle).