{"id":3151,"date":"2011-04-04T18:52:18","date_gmt":"2011-04-04T16:52:18","guid":{"rendered":"http:\/\/blog.jteam.nl\/?p=3151"},"modified":"2011-04-04T18:52:18","modified_gmt":"2011-04-04T16:52:18","slug":"how-to-cluster-seinfeld-episodes-with-mahout","status":"publish","type":"post","link":"https:\/\/trifork.nl\/blog\/how-to-cluster-seinfeld-episodes-with-mahout\/","title":{"rendered":"How to cluster Seinfeld episodes with Mahout"},"content":{"rendered":"<p>This february I gave a talk on Mahout clustering at FOSDEM 2011 where I demonstrated how to cluster Seinfeld episodes. A few people wanted to know how to run this example so I write up a short blog about it. In just a few minutes you can run the Seinfeld demo on your own machine.<br \/>\n<!--more--><\/p>\n<p><b>Update 24-03-2014<\/b> The seinfeld scripts and demo are no longer available. The original Seinfeld scripts were from <a href=\"http:\/\/www.stanthecaddy.com\">stanthecaddy.com<\/a> and they received a take down notice for the scripts. I decided to remove the scripts from my Github branch as well so the demo described below is no longer valid. No soup for you! \ud83d\ude41<\/p>\n<h2>Step 1 Get the Mahout and demo sources from GitHub<\/h2>\n<p>First make sure you have <a href=\"http:\/\/book.git-scm.com\/2_installing_git.html\">git installed<\/a><\/p>\n<p>Now clone my GitHub repo and switch to the seinfeld_demo branch<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\n$ git clone https:\/\/github.com\/frankscholten\/mahout\n$ git checkout seinfeld_demo\n<\/pre>\n<h2>Step 2 Build the source tree<\/h2>\n<p>Enter the mahout directory and build the source tree with<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\n$ mvn clean install -DskipTests=true\n<\/pre>\n<h2>Step 3 Cluster Seinfeld episodes<\/h2>\n<p>Now enter the following<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\n$ examples\/bin\/seinfeld_vectors.sh\n$ examples\/bin\/seinfeld_kmeans.sh\n<\/pre>\n<p>The first script will create vectors from the plain text Seinfeld episodes stored in <code>examples\/src\/main\/resources\/seinfeld-scripts-preprocessed<\/code> and the second<br \/>\nscript clusters the vectors with K-Means. Finally, Mahout&#8217;s <code>clusterdump<\/code> program is used to print the clusters<br \/>\nalong with top 5 labels and episodes to standard output.<\/p>\n<h2>Step 4 Experiment!<\/h2>\n<p>You can tweak some of the command line arguments passed to the Mahout jobs and see how it affects<br \/>\nthe cluster process. Additionally you can extend the <code>SeinfeldAnalyzer<\/code> that came with the demo at <code>examples\/src\/main\/java\/org\/apache\/mahout\/analysis<\/code> or create one yourself.<\/p>\n<p>Enjoy!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This february I gave a talk on Mahout clustering at FOSDEM 2011 where I demonstrated how to cluster Seinfeld episodes. A few people wanted to know how to run this example so I write up a short blog about it. In just a few minutes you can run the Seinfeld demo on your own machine.<\/p>\n","protected":false},"author":6,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":""},"categories":[40,31,10],"tags":[236,237,14],"class_list":["post-3151","post","type-post","status-publish","format-standard","hentry","category-mahout","category-java","category-development","tag-apache-mahout","tag-clustering","tag-conference"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v24.4 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>How to cluster Seinfeld episodes with Mahout - Trifork Blog<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/trifork.nl\/blog\/how-to-cluster-seinfeld-episodes-with-mahout\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"How to cluster Seinfeld episodes with Mahout - Trifork Blog\" \/>\n<meta property=\"og:description\" content=\"This february I gave a talk on Mahout clustering at FOSDEM 2011 where I demonstrated how to cluster Seinfeld episodes. A few people wanted to know how to run this example so I write up a short blog about it. In just a few minutes you can run the Seinfeld demo on your own machine.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/trifork.nl\/blog\/how-to-cluster-seinfeld-episodes-with-mahout\/\" \/>\n<meta property=\"og:site_name\" content=\"Trifork Blog\" \/>\n<meta property=\"article:published_time\" content=\"2011-04-04T16:52:18+00:00\" \/>\n<meta name=\"author\" content=\"frank\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"frank\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"1 minute\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/trifork.nl\/blog\/how-to-cluster-seinfeld-episodes-with-mahout\/\",\"url\":\"https:\/\/trifork.nl\/blog\/how-to-cluster-seinfeld-episodes-with-mahout\/\",\"name\":\"How to cluster Seinfeld episodes with Mahout - Trifork Blog\",\"isPartOf\":{\"@id\":\"https:\/\/trifork.nl\/blog\/#website\"},\"datePublished\":\"2011-04-04T16:52:18+00:00\",\"author\":{\"@id\":\"https:\/\/trifork.nl\/blog\/#\/schema\/person\/00fad6c5829f6770345f23ccace2e54f\"},\"breadcrumb\":{\"@id\":\"https:\/\/trifork.nl\/blog\/how-to-cluster-seinfeld-episodes-with-mahout\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/trifork.nl\/blog\/how-to-cluster-seinfeld-episodes-with-mahout\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/trifork.nl\/blog\/how-to-cluster-seinfeld-episodes-with-mahout\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/trifork.nl\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"How to cluster Seinfeld episodes with Mahout\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/trifork.nl\/blog\/#website\",\"url\":\"https:\/\/trifork.nl\/blog\/\",\"name\":\"Trifork Blog\",\"description\":\"Keep updated on the technical solutions Trifork is working on!\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/trifork.nl\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/trifork.nl\/blog\/#\/schema\/person\/00fad6c5829f6770345f23ccace2e54f\",\"name\":\"frank\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/trifork.nl\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/5c39a948f2b70fa900b25dc79cde8643?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/5c39a948f2b70fa900b25dc79cde8643?s=96&d=mm&r=g\",\"caption\":\"frank\"},\"url\":\"https:\/\/trifork.nl\/blog\/author\/frank\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"How to cluster Seinfeld episodes with Mahout - Trifork Blog","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/trifork.nl\/blog\/how-to-cluster-seinfeld-episodes-with-mahout\/","og_locale":"en_US","og_type":"article","og_title":"How to cluster Seinfeld episodes with Mahout - Trifork Blog","og_description":"This february I gave a talk on Mahout clustering at FOSDEM 2011 where I demonstrated how to cluster Seinfeld episodes. A few people wanted to know how to run this example so I write up a short blog about it. In just a few minutes you can run the Seinfeld demo on your own machine.","og_url":"https:\/\/trifork.nl\/blog\/how-to-cluster-seinfeld-episodes-with-mahout\/","og_site_name":"Trifork Blog","article_published_time":"2011-04-04T16:52:18+00:00","author":"frank","twitter_card":"summary_large_image","twitter_misc":{"Written by":"frank","Est. reading time":"1 minute"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/trifork.nl\/blog\/how-to-cluster-seinfeld-episodes-with-mahout\/","url":"https:\/\/trifork.nl\/blog\/how-to-cluster-seinfeld-episodes-with-mahout\/","name":"How to cluster Seinfeld episodes with Mahout - Trifork Blog","isPartOf":{"@id":"https:\/\/trifork.nl\/blog\/#website"},"datePublished":"2011-04-04T16:52:18+00:00","author":{"@id":"https:\/\/trifork.nl\/blog\/#\/schema\/person\/00fad6c5829f6770345f23ccace2e54f"},"breadcrumb":{"@id":"https:\/\/trifork.nl\/blog\/how-to-cluster-seinfeld-episodes-with-mahout\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/trifork.nl\/blog\/how-to-cluster-seinfeld-episodes-with-mahout\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/trifork.nl\/blog\/how-to-cluster-seinfeld-episodes-with-mahout\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/trifork.nl\/blog\/"},{"@type":"ListItem","position":2,"name":"How to cluster Seinfeld episodes with Mahout"}]},{"@type":"WebSite","@id":"https:\/\/trifork.nl\/blog\/#website","url":"https:\/\/trifork.nl\/blog\/","name":"Trifork Blog","description":"Keep updated on the technical solutions Trifork is working on!","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/trifork.nl\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/trifork.nl\/blog\/#\/schema\/person\/00fad6c5829f6770345f23ccace2e54f","name":"frank","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/trifork.nl\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/5c39a948f2b70fa900b25dc79cde8643?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/5c39a948f2b70fa900b25dc79cde8643?s=96&d=mm&r=g","caption":"frank"},"url":"https:\/\/trifork.nl\/blog\/author\/frank\/"}]}},"_links":{"self":[{"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/posts\/3151","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/comments?post=3151"}],"version-history":[{"count":0,"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/posts\/3151\/revisions"}],"wp:attachment":[{"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/media?parent=3151"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/categories?post=3151"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/tags?post=3151"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}