How to cluster Seinfeld episodes with Mahout
This february I gave a talk on Mahout clustering at FOSDEM 2011 where I demonstrated how to cluster Seinfeld episodes. A few people wanted to know how to run this example so I write up a short blog about it. In just a few minutes you can run the Seinfeld demo on your own machine.
Update 24-03-2014 The seinfeld scripts and demo are no longer available. The original Seinfeld scripts were from stanthecaddy.com and they received a take down notice for the scripts. I decided to remove the scripts from my Github branch as well so the demo described below is no longer valid. No soup for you! 🙁
Step 1 Get the Mahout and demo sources from GitHub
First make sure you have git installed
Now clone my GitHub repo and switch to the seinfeld_demo branch
$ git clone https://github.com/frankscholten/mahout $ git checkout seinfeld_demo
Step 2 Build the source tree
Enter the mahout directory and build the source tree with
$ mvn clean install -DskipTests=true
Step 3 Cluster Seinfeld episodes
Now enter the following
$ examples/bin/seinfeld_vectors.sh $ examples/bin/seinfeld_kmeans.sh
The first script will create vectors from the plain text Seinfeld episodes stored in examples/src/main/resources/seinfeld-scripts-preprocessed
and the second
script clusters the vectors with K-Means. Finally, Mahout’s clusterdump
program is used to print the clusters
along with top 5 labels and episodes to standard output.
Step 4 Experiment!
You can tweak some of the command line arguments passed to the Mahout jobs and see how it affects
the cluster process. Additionally you can extend the SeinfeldAnalyzer
that came with the demo at examples/src/main/java/org/apache/mahout/analysis
or create one yourself.
Enjoy!