Text Mining in Apache Mahout


Posted by Anonymous on 29 Aug 2013 at 18:51

Lately we've been working on text mining using clustering techniques to group together similar documents. Apache Mahout has proven an excellent tool for this. Mahout is an open-source library that implements scalable machine learning algorithms. It is very fast and has excellent integration with other popular open-source Apache libraries, such as hadoop and lucene. One of mahout's core capabilities is clustering. To perform text mining, simply take a bunch of text documents, represent each document as a feature vector that says which words the document contains, and apply a clustering algorithm. A possible application is grouping blogs into different groups that can be targeted for ads.
Here's the basic workflow in mahout:
1. Start with a dataset, i.e. a...

Adventures in Machine Learning


Posted by Anonymous on 26 Jul 2013 at 23:07

Lately I have been thinking about how to recommend movies to movie watchers, purchases to shoppers, artists to music lovers. In general, if you have a bunch of items and a bunch of users, how do you figure out which items to recommend to which users?

There are many solutions to this problem. One extreme solution is to ask a movie connoisseur to learn the movie watcher's tastes, habits, lifestyle, and more. Then the movie connoisseur carefully picks out a few movies he thinks the movie watcher would enjoy. While this solution can produce incredibly personalized results, it is very time-consuming and does not scale. The solution on the opposite end of the spectrum is collaborative filtering. In collaborative filtering, you take all the existing data on which movie watchers like...