Text Mining in Apache Mahout


Posted by Anonymous on 29 Aug 2013 at 18:51

Lately we've been working on text mining using clustering techniques to group together similar documents. Apache Mahout has proven an excellent tool for this. Mahout is an open-source library that implements scalable machine learning algorithms. It is very fast and has excellent integration with other popular open-source Apache libraries, such as hadoop and lucene. One of mahout's core capabilities is clustering. To perform text mining, simply take a bunch of text documents, represent each document as a feature vector that says which words the document contains, and apply a clustering algorithm. A possible application is grouping blogs into different groups that can be targeted for ads.
Here's the basic workflow in mahout:
1. Start with a dataset, i.e. a...