The goal is simple: to keep an eye on arXiv to pick up topics, tools, etc that are picking up steam but aren't, and might never be, massive in any absolute sense. Everything big enough finds its way to us sooner rather than later but most of the interesting things happen in specialized niches. Don't tell me what's doing the rounds everywhere, tell me what a hundred overspecialized nerds are going crazy about.
(To be honest: I already spend too much time following too many topics across too many sources, some of them quite niche. Something like this tool is the last thing I need. But if I were the sort of person that could make the reasonable choice in this situation I wouldn't be the sort of person who'd be in this situation in the first place.)
Once decided on the idea the implementation isn't overly complex:
- Use arXiv's Open Archives Initiative OAI-PMH interface to keep a local copy of article metadata.
- Use Spacy to extract potential terms, phrases, etc, of interest. (*Not* summaries, LLM or otherwise. I'm not just interested in the main topic of an article, but also on tools and concepts creeping about, not being talked about, necessarily, but being used.)
- Run some very ad hoc metrics to pick up "hot" topics according to whatever set of personal filters and heuristics feel right. YMMV - I don't claim to have built a tool with any user in mind but myself, although the concept, I think, might be of more general interest.
- Filter out terms that are too correlated with others in the list, or that I've already listed. This isn't a tool to keep track of "who's winning" but to surface "what's new."
- Do a bit of manual pruning. Taste matters. And besides, some of the changes are of sociological interest (you can probably gauge the mood of a field by keeping track of purely idiomatic changes, as long as you don't take them at face value) but don't point to what I'm interested on here.
The first result of this process is here. Nothing Earth-shattering — even after a lot of adjustments most of "what's hot" is depressingly AI-related — but it did lead me to a couple of interesting new things, so I'll probably keep doing it at least for a while. And there's still an infinite number of tweaks to try and get a higher share of interesting results (as a lot of people have said, a program, like a poem, is never finished, only abandoned).