Quick notes on the danger of LLM benchmarks

2025-04-21

All model benchmarks have limitations — model performance is rarely unidimensional or independent of context — but LLM benchmarks have an intrinsic weakness that interacts very dangerously with their rhetorical usefulness.

I don't mean manipulation or cherry-picking (although that certainly happens). The problem is that text generators, unlike e.g. classifiers, have absurdly unbounded failure modes. Knowing that your classifier gets it right 78% of the time is useful if the other 22% of the time it's missing a high-risk customer. But what does it mean that your LLM gets 78% of an educational benchmark correctly if in some of the other 22% it's promoting genocide to a kid, or even that it gets a 95% success rate giving medical advice if among those 5% there's lethal nonsense?

This isn't a critique of benchmarking as methodology but of how benchmarks are used: improvements in benchmark results are used by companies, investors, media, and in internal company politics to support a narrative of continuous advances while eliding that the fundamental limitation of LLMs, what constrains their safe use, isn't a few percentage points in a benchmark - it's the fact that they are linguistic processors. We have centuries of practical experience showing that linguistic capabilities do not translate to domain understanding, and failures in domain understanding can be arbitrarily deep and damaging. An internal report or published article with a nonsensical conceptual mistake can propagate its influence very far inside an organization or damage its reputation in a significant way.

Improved benchmarks, or even measuring how often a model hallucinates, can give an idea of how often this sort of thing might happen, not what form it will take or how serious it will be. Benchmarks are more directly informative when LLMs are used in tight suggest-verify-iterate cycles (code or email drafting), not for large-scale applications or for blind summarization. Yet those are precisely the use cases that investors and companies would prefer to be deployed — that's where you get the big shifts in wages and capex — which makes the usefulness of benchmarks to support them, together with their inability to meaningfully assess risks, a dangerous factor in the dynamics of AI adoption.