Apache Whirr includes Mahout support

by frankDecember 22, 2011

In a previous blog I showed you how to use Apache Whirr to launch a Hadoop cluster in order to run Mahout jobs. This blog shows you how to use the Mahout service from the brand new Whirr 0.7.0 release to automatically install Hadoop and the Mahout binary distribution on a cloud provider such as Amazon.

Introduction

If you are new to Apache Whirr checkout my previous blog which covers Whirr 0.4.0. A lot has changed since then. After several services, bug fixes, improvements Whirr became a top level Apache project with its new version 0.7.0 released yesterday! During the last weeks I worked on a Apache Mahout service for Whirr included in the latest release. (Thanks to the Whirr community and Andrei Savu in particular for reviewing the code and helping out to ship this cool feature!)

How to use the Mahout service

The Mahout service in Whirr defines the mahout-client role. This role will install the binary Mahout distribution on a given node. To use this feature checkout the sources from https://svn.apache.org/repos/asf/whirr/trunk or http://svn.apache.org/repos/asf/whirr/tags/release-0.7.0/ or clone the project with Git at http://git.apache.org/whirr.git and build it with a mvn clean install. Let me walk you through an example how to use this on Amazon AWS.

Step 1 Create a node template

Create a file called mahout-cluster.properties and add the following

whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode+mahout-client,2 hadoop-datanode+hadoop-tasktracker

whirr.provider=aws-ec2
whirr.identity=TOP_SECRET
whirr.credential=TOP_SECRET

This setup configures two Hadoop datanode / tasktrackers and one Hadoop namenode / jobtracker / mahout-client node. For the mahout- client role, Whirr will:

* Download the binary distribution from Apache and install it under /usr/local/mahout

* Set MAHOUT_HOME to /usr/local/mahout

* Add $MAHOUT_HOME/bin to the PATH

(Optional) Configure the Mahout version and / or distribution url

By default, Whirr will download the Mahout distribution from
http://archive.apache.org/dist/mahout/0.5/mahout-distribution-0.5.tar.gz
You can override the version by adding
whirr.mahout.version=VERSION

Also, you can change the download url entirely; useful if you want to test your own version of Mahout. To do so, first create a Mahout binary distribution by entering the mahout distribution folder in your checked out Mahout source tree and run

$ mvn clean install -Dskip.mahout.distribution=false

Now put the tarball on a server that will be accessible by the cluster and add the following line to your mahout-cluster.properties

whirr.mahout.tarball.url=MAHOUT_TARBALL_URL

Step 2 Launch the cluster

You can now launch the cluster the regular way by running:

$ whirr launch-cluster --config mahout-cluster.properties

Step 3 Login & run

When the cluster is setup, run the Hadoop proxy, upload some data, SSH into the node and voilà, you can run Mahout jobs by invoking the command line script like you would do normally, such as:

$ mahout seqdirectory --input input --output output

Enjoy!