Adventures in Machine Learning

Lately I have been thinking about how to recommend movies to movie watchers, purchases to shoppers, artists to music lovers. In general, if you have a bunch of items and a bunch of users, how do you figure out which items to recommend to which users?

There are many solutions to this problem. One extreme solution is to ask a movie connoisseur to learn the movie watcher's tastes, habits, lifestyle, and more. Then the movie connoisseur carefully picks out a few movies he thinks the movie watcher would enjoy. While this solution can produce incredibly personalized results, it is very time-consuming and does not scale. The solution on the opposite end of the spectrum is collaborative filtering. In collaborative filtering, you take all the existing data on which movie watchers like which movies, and feed it into an algorithm. Then you ask the algorithm to recommend movies for a movie watcher, and it produces results in a matter of milliseconds. Once set up properly, a collaborative filter recommender can produce excellent real-time recommendations, with very little human effort spent.

Collaborative filtering works in three steps:

You send the recommender engine a list of user-item preferences. For example: Johan likes Toy Story, Quyen likes The Matrix, Frida likes Pulp Fiction, etc.
The recommender engine builds itself from these preferences.
You ask the recommender engine to recommend new items to specific users. For example: show me 10 movies that I should recommend to Quyen.

For many recommenders, you can add new user-item preferences and the recommender engine will update itself in real time.

There are a few different flavors of collaborative filtering recommenders:

User-based: If I have bought 50 items on Amazon, the recommender will find other Amazon users who have also bought many of those same items. The other items that those users have bought will be recommended to me.
Item-based: If I have bought 50 items on Amazon, the recommender will go through each item and find other items that have been purchased by many users who purchased the item. These other items will be recommended to me.
Model-based: an algorithm performs some sort of dimension reduction, figures out latent factors embedded in the data, and uses those to produce recommendations. A latent factor could be as intuitive as “people who buy candles also buy picture frames”, but more often the latent factors don’t have an obvious, intuitive meaning.

User-based and item-based recommenders are very intuitive and run fast on “small” data sets (where small can mean up to 10 or 20 million user-item preferences). Model-based recommenders often produce less intuitive (but valuable) results, and can be scaled up to handle much larger data sets by distributing the computation.

Say you would like to incorporate a collaborative filtering recommender into your web application. How might you accomplish this? My tool of choice is Apache Mahout, a popular open-source library that implements scalable machine learning algorithms. More on mahout in future posts.