How to cluster Seinfeld episodes with Mahout

by frankApril 4, 2011

This february I gave a talk on Mahout clustering at FOSDEM 2011 where I demonstrated how to cluster Seinfeld episodes. A few people wanted to know how to run this example so I write up a short blog about it. In just a few minutes you can run the Seinfeld demo on your own machine.

Update 24-03-2014 The seinfeld scripts and demo are no longer available. The original Seinfeld scripts were from stanthecaddy.com and they received a take down notice for the scripts. I decided to remove the scripts from my Github branch as well so the demo described below is no longer valid. No soup for you! 🙁

Step 1 Get the Mahout and demo sources from GitHub

First make sure you have git installed

Now clone my GitHub repo and switch to the seinfeld_demo branch

$ git clone https://github.com/frankscholten/mahout
$ git checkout seinfeld_demo

Step 2 Build the source tree

Enter the mahout directory and build the source tree with

$ mvn clean install -DskipTests=true

Step 3 Cluster Seinfeld episodes

Now enter the following

$ examples/bin/seinfeld_vectors.sh
$ examples/bin/seinfeld_kmeans.sh

The first script will create vectors from the plain text Seinfeld episodes stored in examples/src/main/resources/seinfeld-scripts-preprocessed and the second
script clusters the vectors with K-Means. Finally, Mahout’s clusterdump program is used to print the clusters
along with top 5 labels and episodes to standard output.

Step 4 Experiment!

You can tweak some of the command line arguments passed to the Mahout jobs and see how it affects
the cluster process. Additionally you can extend the SeinfeldAnalyzer that came with the demo at examples/src/main/java/org/apache/mahout/analysis or create one yourself.

Enjoy!