Running Mahout in the Cloud using Apache Whirr

by frankJune 21, 2011

This blog shows you how to run Mahout in the cloud, using Apache Whirr. Apache Whirr is a promosing Apache incubator project for quickly launching cloud instances, from Hadoop to Cassandra, Hbase, Zookeeper and so on. I will show you how to setup a Hadoop cluster and run Mahout jobs both via the command line and Whirr’s Java API (version 0.4).

What is Whirr?

Apache Whirr is a set of libraries and tools that allow you to run different services in the cloud. It currently has support for Amazon EC2 and Rackspace Cloud Servers. Services it currently supports are: Cassandra, Hadoop, Zookeeper, HBase, ElasticSearch and Voldemort.

With whirr you can easily start these cloud services via the command line or a Java API. For example, you specify the following property file:

<br>
whirr.service-name=hadoop<br>
whirr.cluster-name=test-cluster<br>
whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode, 2 hadoop-datanode+hadoop-tasktracker<br>
whirr.provider=aws-ec2<br>
whirr.location-id=eu-west-1<br>
whirr.hardware-id=m1.small<br>
whirr.image-id=eu-west-1/ami-1b9fa86f<br>
whirr.identity=${env.AWS_ACCESS_KEY}<br>
whirr.credential=${env.AWS_SECRET_KEY}<br>
whirr.private-key-file=/home/frank/.ssh/id_rsa_whirr<br>
whirr.public-key-file=/home/frank/.ssh/id_rsa_whirr.pub<br>
whirr.cluster-user=frank<br>

and then run the following to start a cluster.

<br>
$ whirr launch-cluster --config [path-to-property-file]<br>

When the cluster is started you have to run the proxy script in ~/.whirr/[clustername]/hadoop-proxy.sh and then you can SSH into the cluster.

Why use Whirr with Mahout

Below are some reasons for using Whirr with Mahout specifically. The first is the principle of ‘convention over configuration’. This means that you specify what kind of cluster you want, not the specifics of how to create it. When using Mahout you mostly want to run Hadoop jobs, so you want to start a Hadoop cluster, which is supported out-of-the-box by Whirr.

The second reason is that Whirr enables transparant job submission. It generates a hadoop-site.xml in ~/.whirr/ on your local machine at startup. By pointing HADOOP_CONF_DIR to this directory you can transparently launch jobs from your local machine to the cluster. It’s almost as if your local machine is the cluster. This transparency is of course a Hadoop feature, but Whirr enables this automatically. Compare this with running Mahout on Amazon’s Elastic Map Reduce where you have manual steps like uploading the Mahout jar and referencing the jar’s location in any subsequent command line parameters. With Whirr you can run Mahout jobs from Java without a separate API for launching jobs on Amazon. More on running Mahout jobs from Java later on.

Getting started with Whirr

Step 1 – Prequisites

To get started you first need a Amazon Web Services credentials and install Whirr. Check out Whirr founder Tom White’s Whirr in 5 minutes to see how to get everything up and running.

Step 2 – Overriding Hadoop properties

Whirr allows you to override Hadoop properties. For instance, common changes are increasing heap space and ulimit for tasks. To specify this add the following to Whirr’s property file:

<br>
hadoop-mapreduce.mapred.child.java.opts=-Xmx1000m<br>
hadoop-mapreduce.mapred.child.ulimit=1500000<br>

The prefix hadoop-mapreduce specifies that these properties will be written to the mapred-site.xml file on all machines in the cluster. You can also use the hadoop-common and hadoop-hdfs prefix to specify properties belonging to core-site.xml and hdfs-site.xml, respectively. This additional configuration will not be added to the hadoop-site.xml on your local machine, but that’s not a problem.

Step 3 – Whirr’s Java API

You can also use Whirr’s Java API to start a cluster. The snippet below shows how to launch and destroy a cluster based on a given Whirr property file.

<br>
ClusterSpec clusterSpec = new ClusterSpec(new PropertiesConfiguration(whirrConfigFile), false);<br>
Service service = new Service();</p>
<p>Cluster cluster = service.launchCluster(clusterSpec);</p>
<p>HadoopProxy proxy = new HadoopProxy(clusterSpec, cluster);<br>
proxy.start();</p>
<p>// Launch jobs</p>
<p>proxy.stop();<br>
service.destroyCluster(clusterSpec);<br>

The constructor loads the specified Whirr property file, launches the cluster and the Hadoop proxy. The hadoop proxy is needed to be able to access the cluster and the JobTracker or NameNode UI at http://[jobtracker]:50030 and http://[namenode]:50070 respectively.

Step 4 – Building the Mahout job jar

Before we can run jobs we need to build the Mahout job jar. A job jar is a Hadoop convention and is a jar that has a lib folder with job-specific dependencies. The Mahout job jar has Lucene as one of its dependencies for example. When you submit a job with Hadoop it will look for a jar on the classpath and submit it to the cluster. If your job jar is not found it might select a normal jar without the dependencies and you will get ClassNotFoundExceptions. To build the Mahout example job jar, run the following commands:

<br>
$ svn co http://svn.apache.org/repos/asf/mahout/trunk mahout<br>
$ cd mahout<br>
$ mvn clean install -DskipTests=true<br>

If you want to run a mahout job from your IDE you need to put the mahout job jar on the classpath. In IntelliJ you can add the job jar via ‘Project Structure’ > ‘Dependencies’ > ‘Add Single-Entry Module Library’. If you don’t add the correct job jar to the classpath you get the following warning

“No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String)”

Step 6 – Uploading data

Use the snippet below to create a directory on HDFS and upload data to the cluster.

<br>
$ export HADOOP_CONF_DIR=~/.whirr/cluster-name<br>
$ hadoop fs -mkdir input<br>
$ hadoop fs -put seinfeld-scripts input<br>

Now you can upload some data to the cluster via the command line, for instance the Seinfeld dataset from one of my earlier blogs

Step 7 – Loading Hadoop Configuration in Java

The next step is to load a Hadoop Configuration object that points to your cluster. Use the following:

<br>
PropertiesConfiguration props = new PropertiesConfiguration(whirrConfigFile);<br>
String clusterName = props.getString("whirr.cluster-name");</p>
<p>Configuration configuration = new Configuration();<br>
configuration.addResource(new Path(System.getProperty("user.home"), ".whirr/" + clusterName + "/" + "hadoop-site.xml"));<br>

This configuration object can now be passed into your Mahout jobs

Step 5 – Run!

The cluster is running, the job jar is built, you can now run a job via your IDE or the command line. To run from Java you can use the ToolRunner to run Mahout’s Driver classes. The arguments to ToolRunner are the Configuration object loaded with values from Whirr and the parameters of the job in a String[]

<br>
String[] seq2SparseParams = new String[] {<br>
"--input", textOutputPath.toString(),<br>
"--output", sparseOutputPath,<br>
"--weight", "TFIDF",<br>
"--norm", "2",<br>
"--maxNGramSize", "2",<br>
"--namedVector",<br>
"--maxDFPercent", "50",<br>
};</p>
<p>ToolRunner.run(configuration, new SparseVectorsFromSequenceFiles(), seq2SparseParams);<br>

or you can run mahout from the command line. Enjoy!

P.S. Don’t forget to shutdown your cluster 😉

<br>
$ whirr destroy-cluster --config [path-to-property-file]<br>