{"id":3225,"date":"2011-06-21T12:27:40","date_gmt":"2011-06-21T10:27:40","guid":{"rendered":"http:\/\/blog.jteam.nl\/?p=3225"},"modified":"2011-06-21T12:27:40","modified_gmt":"2011-06-21T10:27:40","slug":"running-mahout-in-the-cloud-using-apache-whirr","status":"publish","type":"post","link":"https:\/\/trifork.nl\/blog\/running-mahout-in-the-cloud-using-apache-whirr\/","title":{"rendered":"Running Mahout in the Cloud using Apache Whirr"},"content":{"rendered":"<p>This blog shows you how to run Mahout in the cloud, using <a href=\"http:\/\/incubator.apache.org\/whirr\/\">Apache Whirr<\/a>. Apache Whirr is a promosing Apache incubator project for quickly launching cloud instances, from Hadoop to Cassandra, Hbase, Zookeeper and so on. I will show you how to setup a Hadoop cluster and run Mahout jobs both via the command line and Whirr&#8217;s Java API (version 0.4).<\/p>\n<p><!--more--><\/p>\n<h2>What is Whirr?<\/h2>\n<p>Apache Whirr is a set of libraries and tools that allow you to run different services in the cloud. It currently has support for Amazon EC2 and Rackspace Cloud Servers. Services it currently supports are: Cassandra, Hadoop, Zookeeper, HBase, ElasticSearch and Voldemort.<\/p>\n<p>With whirr you can easily start these cloud services via the command line or a Java API. For example, you specify the following property file:<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">&amp;lt;br&amp;gt;\nwhirr.service-name=hadoop&amp;lt;br&amp;gt;\nwhirr.cluster-name=test-cluster&amp;lt;br&amp;gt;\nwhirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode, 2 hadoop-datanode+hadoop-tasktracker&amp;lt;br&amp;gt;\nwhirr.provider=aws-ec2&amp;lt;br&amp;gt;\nwhirr.location-id=eu-west-1&amp;lt;br&amp;gt;\nwhirr.hardware-id=m1.small&amp;lt;br&amp;gt;\nwhirr.image-id=eu-west-1\/ami-1b9fa86f&amp;lt;br&amp;gt;\nwhirr.identity=${env.AWS_ACCESS_KEY}&amp;lt;br&amp;gt;\nwhirr.credential=${env.AWS_SECRET_KEY}&amp;lt;br&amp;gt;\nwhirr.private-key-file=\/home\/frank\/.ssh\/id_rsa_whirr&amp;lt;br&amp;gt;\nwhirr.public-key-file=\/home\/frank\/.ssh\/id_rsa_whirr.pub&amp;lt;br&amp;gt;\nwhirr.cluster-user=frank&amp;lt;br&amp;gt;\n<\/pre>\n<p>and then run the following to start a cluster.<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">&amp;lt;br&amp;gt;\n$ whirr launch-cluster --config &#x5B;path-to-property-file]&amp;lt;br&amp;gt;\n<\/pre>\n<p>When the cluster is started you have to run the proxy script in <code>~\/.whirr\/[clustername]\/hadoop-proxy.sh<\/code> and then you can SSH into the cluster.<\/p>\n<h2>Why use Whirr with Mahout<\/h2>\n<p>Below are some reasons for using Whirr with Mahout specifically. The first is the principle of &#8216;convention over configuration&#8217;. This means that you specify what kind of cluster you want, not the specifics of how to create it. When using Mahout you mostly want to run Hadoop jobs, so you want to start a Hadoop cluster, which is supported out-of-the-box by Whirr.<\/p>\n<p>The second reason is that Whirr enables transparant job submission. It generates a <code>hadoop-site.xml<\/code> in <code>~\/.whirr\/<\/code> on your local machine at startup. By pointing <code>HADOOP_CONF_DIR<\/code> to this directory you can transparently launch jobs from your local machine to the cluster. It&#8217;s almost as if your local machine is the cluster. This transparency is of course a Hadoop feature, but Whirr enables this automatically. Compare this with running Mahout on Amazon&#8217;s Elastic Map Reduce where you have manual steps like uploading the Mahout jar and referencing the jar&#8217;s location in any subsequent command line parameters. With Whirr you can run Mahout jobs from Java without a separate API for launching jobs on Amazon. More on running Mahout jobs from Java later on.<\/p>\n<h2>Getting started with Whirr<\/h2>\n<h3>Step 1 &#8211; Prequisites<\/h3>\n<p>To get started you first need a Amazon Web Services credentials and install Whirr. Check out Whirr founder Tom White&#8217;s <a href=\"http:\/\/www.lexemetech.com\/2011\/04\/whirr-in-5-minutes.html\">Whirr in 5 minutes<\/a> to see how to get everything up and running.<\/p>\n<h3>Step 2 &#8211; Overriding Hadoop properties<\/h3>\n<p>Whirr allows you to <a href=\"https:\/\/issues.apache.org\/jira\/browse\/WHIRR-55\">override Hadoop properties<\/a>. For instance, common changes are increasing heap space and ulimit for tasks. To specify this add the following to Whirr&#8217;s property file:<\/p>\n<pre class=\"brush: java; title: ; notranslate\" title=\"\">&amp;lt;br&amp;gt;\nhadoop-mapreduce.mapred.child.java.opts=-Xmx1000m&amp;lt;br&amp;gt;\nhadoop-mapreduce.mapred.child.ulimit=1500000&amp;lt;br&amp;gt;\n<\/pre>\n<p>The prefix <code>hadoop-mapreduce<\/code> specifies that these properties will be written to the <code>mapred-site.xml<\/code> file on all machines in the cluster. You can also use the <code>hadoop-common<\/code> and <code>hadoop-hdfs<\/code> prefix to specify properties belonging to <code>core-site.xml<\/code> and <code>hdfs-site.xml<\/code>, respectively. This additional configuration will not be added to the <code>hadoop-site.xml<\/code> on your local machine, but that&#8217;s not a problem.<\/p>\n<h3>Step 3 &#8211; Whirr&#8217;s Java API<\/h3>\n<p>You can also use Whirr&#8217;s Java API to start a cluster. The snippet below shows how to launch and destroy a cluster based on a given Whirr property file.<\/p>\n<pre class=\"brush: java; title: ; notranslate\" title=\"\">&amp;lt;br&amp;gt;\nClusterSpec clusterSpec = new ClusterSpec(new PropertiesConfiguration(whirrConfigFile), false);&amp;lt;br&amp;gt;\nService service = new Service();&amp;lt;\/p&amp;gt;\n&amp;lt;p&amp;gt;Cluster cluster = service.launchCluster(clusterSpec);&amp;lt;\/p&amp;gt;\n&amp;lt;p&amp;gt;HadoopProxy proxy = new HadoopProxy(clusterSpec, cluster);&amp;lt;br&amp;gt;\nproxy.start();&amp;lt;\/p&amp;gt;\n&amp;lt;p&amp;gt;\/\/ Launch jobs&amp;lt;\/p&amp;gt;\n&amp;lt;p&amp;gt;proxy.stop();&amp;lt;br&amp;gt;\nservice.destroyCluster(clusterSpec);&amp;lt;br&amp;gt;\n<\/pre>\n<p>The constructor loads the specified Whirr property file, launches the cluster and the Hadoop proxy. The hadoop proxy is needed to be able to access the cluster and the JobTracker or NameNode UI at http:\/\/[jobtracker]:50030 and http:\/\/[namenode]:50070 respectively.<\/p>\n<h3>Step 4 &#8211; Building the Mahout job jar<\/h3>\n<p>Before we can run jobs we need to build the Mahout job jar. A job jar is a Hadoop convention and is a jar that has a lib folder with job-specific dependencies. The Mahout job jar has Lucene as one of its dependencies for example. When you submit a job with Hadoop it will look for a jar on the classpath and submit it to the cluster. If your job jar is not found it might select a normal jar <strong>without<\/strong> the dependencies and you will get <code>ClassNotFoundExceptions<\/code>. To build the Mahout example job jar, run the following commands:<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">&amp;lt;br&amp;gt;\n$ svn co http:\/\/svn.apache.org\/repos\/asf\/mahout\/trunk mahout&amp;lt;br&amp;gt;\n$ cd mahout&amp;lt;br&amp;gt;\n$ mvn clean install -DskipTests=true&amp;lt;br&amp;gt;\n<\/pre>\n<p>If you want to run a mahout job from your IDE you need to put the <strong>mahout job jar<\/strong> on the classpath. In IntelliJ you can add the job jar via &#8216;Project Structure&#8217; &gt; &#8216;Dependencies&#8217; &gt; &#8216;Add Single-Entry Module Library&#8217;. If you don&#8217;t add the correct job jar to the classpath you get the following warning<\/p>\n<blockquote><p>\n&#8220;No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String)&#8221;<\/p><\/blockquote>\n<h3>Step 6 &#8211; Uploading data<\/h3>\n<p>Use the snippet below to create a directory on HDFS and upload data to the cluster.<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">&amp;lt;br&amp;gt;\n$ export HADOOP_CONF_DIR=~\/.whirr\/cluster-name&amp;lt;br&amp;gt;\n$ hadoop fs -mkdir input&amp;lt;br&amp;gt;\n$ hadoop fs -put seinfeld-scripts input&amp;lt;br&amp;gt;\n<\/pre>\n<p>Now you can upload some data to the cluster via the command line, for instance the Seinfeld dataset from <a href=\"http:\/\/blog.jteam.nl\/2011\/04\/04\/how-to-cluster-seinfeld-episodes-with-mahout\/\">one of my earlier blogs<\/a><\/p>\n<h3>Step 7 &#8211; Loading Hadoop Configuration in Java<\/h3>\n<p>The next step is to load a Hadoop <code>Configuration<\/code> object that points to your cluster. Use the following:<\/p>\n<pre class=\"brush: java; title: ; notranslate\" title=\"\">&amp;lt;br&amp;gt;\nPropertiesConfiguration props = new PropertiesConfiguration(whirrConfigFile);&amp;lt;br&amp;gt;\nString clusterName = props.getString(&quot;whirr.cluster-name&quot;);&amp;lt;\/p&amp;gt;\n&amp;lt;p&amp;gt;Configuration configuration = new Configuration();&amp;lt;br&amp;gt;\nconfiguration.addResource(new Path(System.getProperty(&quot;user.home&quot;), &quot;.whirr\/&quot; + clusterName + &quot;\/&quot; + &quot;hadoop-site.xml&quot;));&amp;lt;br&amp;gt;\n<\/pre>\n<p>This configuration object can now be passed into your Mahout jobs<\/p>\n<h3>Step 5 &#8211; Run!<\/h3>\n<p>The cluster is running, the job jar is built, you can now run a job via your IDE or the command line. To run from Java you can use the ToolRunner to run Mahout&#8217;s <code>Driver<\/code> classes. The arguments to <code>ToolRunner<\/code> are the <code>Configuration<\/code> object loaded with values from Whirr and the parameters of the job in a <code>String[]<\/code><\/p>\n<pre class=\"brush: java; title: ; notranslate\" title=\"\">&amp;lt;br&amp;gt;\nString&#x5B;] seq2SparseParams = new String&#x5B;] {&amp;lt;br&amp;gt;\n&quot;--input&quot;, textOutputPath.toString(),&amp;lt;br&amp;gt;\n&quot;--output&quot;, sparseOutputPath,&amp;lt;br&amp;gt;\n&quot;--weight&quot;, &quot;TFIDF&quot;,&amp;lt;br&amp;gt;\n&quot;--norm&quot;, &quot;2&quot;,&amp;lt;br&amp;gt;\n&quot;--maxNGramSize&quot;, &quot;2&quot;,&amp;lt;br&amp;gt;\n&quot;--namedVector&quot;,&amp;lt;br&amp;gt;\n&quot;--maxDFPercent&quot;, &quot;50&quot;,&amp;lt;br&amp;gt;\n};&amp;lt;\/p&amp;gt;\n&amp;lt;p&amp;gt;ToolRunner.run(configuration, new SparseVectorsFromSequenceFiles(), seq2SparseParams);&amp;lt;br&amp;gt;\n<\/pre>\n<p>or you can run mahout from the command line. Enjoy!<\/p>\n<p>P.S. Don&#8217;t forget to shutdown your cluster \ud83d\ude09<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">&amp;lt;br&amp;gt;\n$ whirr destroy-cluster --config &#x5B;path-to-property-file]&amp;lt;br&amp;gt;\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>This blog shows you how to run Mahout in the cloud, using Apache Whirr. Apache Whirr is a promosing Apache incubator project for quickly launching cloud instances, from Hadoop to Cassandra, Hbase, Zookeeper and so on. I will show you how to setup a Hadoop cluster and run Mahout jobs both via the command line [&hellip;]<\/p>\n","protected":false},"author":6,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":""},"categories":[40,15],"tags":[42,251,236,115,252,74,158,253],"class_list":["post-3225","post","type-post","status-publish","format-standard","hentry","category-mahout","category-enterprise-search","tag-apache","tag-apache-hadoop","tag-apache-mahout","tag-cloud","tag-cluster","tag-hadoop","tag-recommendations","tag-whirr"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v24.4 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Running Mahout in the Cloud using Apache Whirr - Trifork Blog<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/trifork.nl\/blog\/running-mahout-in-the-cloud-using-apache-whirr\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Running Mahout in the Cloud using Apache Whirr - Trifork Blog\" \/>\n<meta property=\"og:description\" content=\"This blog shows you how to run Mahout in the cloud, using Apache Whirr. Apache Whirr is a promosing Apache incubator project for quickly launching cloud instances, from Hadoop to Cassandra, Hbase, Zookeeper and so on. I will show you how to setup a Hadoop cluster and run Mahout jobs both via the command line [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/trifork.nl\/blog\/running-mahout-in-the-cloud-using-apache-whirr\/\" \/>\n<meta property=\"og:site_name\" content=\"Trifork Blog\" \/>\n<meta property=\"article:published_time\" content=\"2011-06-21T10:27:40+00:00\" \/>\n<meta name=\"author\" content=\"frank\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"frank\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"7 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/trifork.nl\/blog\/running-mahout-in-the-cloud-using-apache-whirr\/\",\"url\":\"https:\/\/trifork.nl\/blog\/running-mahout-in-the-cloud-using-apache-whirr\/\",\"name\":\"Running Mahout in the Cloud using Apache Whirr - Trifork Blog\",\"isPartOf\":{\"@id\":\"https:\/\/trifork.nl\/blog\/#website\"},\"datePublished\":\"2011-06-21T10:27:40+00:00\",\"author\":{\"@id\":\"https:\/\/trifork.nl\/blog\/#\/schema\/person\/00fad6c5829f6770345f23ccace2e54f\"},\"breadcrumb\":{\"@id\":\"https:\/\/trifork.nl\/blog\/running-mahout-in-the-cloud-using-apache-whirr\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/trifork.nl\/blog\/running-mahout-in-the-cloud-using-apache-whirr\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/trifork.nl\/blog\/running-mahout-in-the-cloud-using-apache-whirr\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/trifork.nl\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Running Mahout in the Cloud using Apache Whirr\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/trifork.nl\/blog\/#website\",\"url\":\"https:\/\/trifork.nl\/blog\/\",\"name\":\"Trifork Blog\",\"description\":\"Keep updated on the technical solutions Trifork is working on!\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/trifork.nl\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/trifork.nl\/blog\/#\/schema\/person\/00fad6c5829f6770345f23ccace2e54f\",\"name\":\"frank\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/trifork.nl\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/5c39a948f2b70fa900b25dc79cde8643?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/5c39a948f2b70fa900b25dc79cde8643?s=96&d=mm&r=g\",\"caption\":\"frank\"},\"url\":\"https:\/\/trifork.nl\/blog\/author\/frank\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Running Mahout in the Cloud using Apache Whirr - Trifork Blog","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/trifork.nl\/blog\/running-mahout-in-the-cloud-using-apache-whirr\/","og_locale":"en_US","og_type":"article","og_title":"Running Mahout in the Cloud using Apache Whirr - Trifork Blog","og_description":"This blog shows you how to run Mahout in the cloud, using Apache Whirr. Apache Whirr is a promosing Apache incubator project for quickly launching cloud instances, from Hadoop to Cassandra, Hbase, Zookeeper and so on. I will show you how to setup a Hadoop cluster and run Mahout jobs both via the command line [&hellip;]","og_url":"https:\/\/trifork.nl\/blog\/running-mahout-in-the-cloud-using-apache-whirr\/","og_site_name":"Trifork Blog","article_published_time":"2011-06-21T10:27:40+00:00","author":"frank","twitter_card":"summary_large_image","twitter_misc":{"Written by":"frank","Est. reading time":"7 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/trifork.nl\/blog\/running-mahout-in-the-cloud-using-apache-whirr\/","url":"https:\/\/trifork.nl\/blog\/running-mahout-in-the-cloud-using-apache-whirr\/","name":"Running Mahout in the Cloud using Apache Whirr - Trifork Blog","isPartOf":{"@id":"https:\/\/trifork.nl\/blog\/#website"},"datePublished":"2011-06-21T10:27:40+00:00","author":{"@id":"https:\/\/trifork.nl\/blog\/#\/schema\/person\/00fad6c5829f6770345f23ccace2e54f"},"breadcrumb":{"@id":"https:\/\/trifork.nl\/blog\/running-mahout-in-the-cloud-using-apache-whirr\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/trifork.nl\/blog\/running-mahout-in-the-cloud-using-apache-whirr\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/trifork.nl\/blog\/running-mahout-in-the-cloud-using-apache-whirr\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/trifork.nl\/blog\/"},{"@type":"ListItem","position":2,"name":"Running Mahout in the Cloud using Apache Whirr"}]},{"@type":"WebSite","@id":"https:\/\/trifork.nl\/blog\/#website","url":"https:\/\/trifork.nl\/blog\/","name":"Trifork Blog","description":"Keep updated on the technical solutions Trifork is working on!","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/trifork.nl\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/trifork.nl\/blog\/#\/schema\/person\/00fad6c5829f6770345f23ccace2e54f","name":"frank","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/trifork.nl\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/5c39a948f2b70fa900b25dc79cde8643?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/5c39a948f2b70fa900b25dc79cde8643?s=96&d=mm&r=g","caption":"frank"},"url":"https:\/\/trifork.nl\/blog\/author\/frank\/"}]}},"_links":{"self":[{"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/posts\/3225","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/comments?post=3225"}],"version-history":[{"count":0,"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/posts\/3225\/revisions"}],"wp:attachment":[{"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/media?parent=3225"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/categories?post=3225"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/tags?post=3225"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}