The Aliens/The Unbearable Lightness of Being classification space of movies

Still playing with the Group Lens movies data set, I implemented a couple of ideas from Shailesh Kumar, one of the Google researchers that came up with the logical itemset mining algorithm. That improved the clustering of movies quite a bit, and gave me the idea to "choose a basis," so to speak, and project these clusters into a more familiar Euclidean representation (although networks and clusters are fast becoming part of our culture's vernacular, interestingly).

This is what I did: I chose two movies from the data set, Aliens and The Unbearable Lightness of Being as the "basis vectors" of the "movie space." For every other movie in the data set, I found the shortest path between the movie and each basis vector on the weighted graph in the logical itemset mining algorithm that underlies the final selection of clusters. That gave me a couple of coordinates for each movie (its "distance from Aliens" and "distance from The Unbearable..."). Rounding coordinates to integers and choosing an small sample that covers the space well, here's a selected map of "movie space" (you will want to click on it to see it at full size):

movie_space_plot

Agreeably enough, this map has a number of features you'd expect from something like this, as well as some interesting (to me) quirks:

  • There is no movie that is close to both basis movies (although if anybody wants to produce The Unbearable Lightness of Chestbursters, I'd love to write that script).
  • The least-The Unbearable... of the similar-to-Aliens movies in this sub-sample is Raiders of the Lost Ark, which makes sense (it's campy, but it's still an adventure movie).
  • Dangerous Liaisons isn't that far from The Unbearable.., but as far away as you can get from Aliens.
  • Wayne's World is way out there.

It's fun to imagine the use of geometrical analogies to use this kind of mapping for practical applications. For example, movie night negotiation between two or more people could be approached as finding the movie vector with the lowest euclidean norm among the available options, where the basis is the set of each person's personal choice or favorite movie, and so on.