Quick link: Lovecraftian data sets

2024-07-22

The link: Beyond Euclid: An Illustrated Guide to Modern Machine Learning with Geometric, Topological, and Algebraic Structures

From the abstract: The enduring legacy of Euclidean geometry underpins classical machine learning, which, for decades, has been primarily developed for data lying in Euclidean space. Yet, modern machine learning increasingly encounters richly structured data that is inherently nonEuclidean. This data can exhibit intricate geometric, topological and algebraic structure [...] Echoing the 19th-century revolutions that gave rise to non-Euclidean geometry, an emerging line of research is redefining modern machine learning with non-Euclidean structures. [..] In this review, we provide an accessible gateway to this fast-growing field and propose a graphical taxonomy that integrates recent advances into an intuitive unified framework.

None

Why you should read it: The abstract is actually conservative. Most interesting data sets don't have natural representation in an Euclidean space (or as natural language, for that matter), and working on the wrong algebraic structure leads at best to poor model performance and at worst to catastrophic failures to generalize even in data-rich regimes. There's less push-button ML support for non-Euclidean data — although if you depend on push-button ML your problems go deeper than algebra — but an awareness of the issue can suggest ad hoc ways to mitigate it. You'd be surprised how much the right algebraic transform, guided by your understanding of the domain and the semantics of the data, can improve the performance of a ML model.