Here's an interesting pair:
Inferring Capabilities from Task Performance with Bayesian Triangulation: The paper describes a way to use hierarchical Bayesian networks to figure out the capability profiles of AI agents from their task performances; it's not purely data-driven — you have to bring in your own hypothesis about the structure of cognitive demands and their relationship with task characteristics — but in exchange for that you get very interesting and flexible ways to poke inside the minds of agents. But there's nothing here that's unique to AIs, of course: the same general strategy applies to profiling individual human capabilities. Most interestingly: why not both humans and AIs at the same task?
Limited information-processing capacity in vision explains number psychophysics: This paper has in principle nothing to do with AI: it looks at a very interesting model of human visual capabilities: in particular, how well (or badly) we can figure out how many objects we are seeing at a glance. It doesn't seem Earth-shattering, but, for all that now and then people talk about Matrix-style brain-computer interfaces, "mind uploading," etc, we still don't have a good grasp of even this sort of basic functionality. In any case, the model they propose is conceptually striking: that the visual system operates pretty close to what you would expect of a Bayes-like statistical inference system, with a limited information bound. Put like this it seems natural, but it's an elegant formulation (and the mathematics in the paper are also quite nice), and people familiar with information bottleneck AI architectures will have already recognized the familiar tune. A more generic point, though, is that Bayes-like statistical inference with a limited information bound is likely to be a good model to look at multiple cognitive capabilities in humans and AIs (and I'd say also animals and organizations) --- comparisons of priors and information bounds, and of course deviations from optimality, are probably a good parameter system to compare different thinking systems engaged in the same task.
"Compare different thinking systems engaged in the same task," by the way, is fast becoming the core managerial activity in the post-AI economy (although historians will tell you it always was).
There are three things you can do with the papers in this post:
The first one is let's say the up-to-2020 usual reflex, the second one is what most everybody in 2023 will be tempted to do, but the third one, I think, is what you will want in 2025 to have already been doing.