Trifork Blog

Running Mahout in the Cloud using Apache Whirr

June 21st, 2011 by
|

This blog shows you how to run Mahout in the cloud, using Apache Whirr. Apache Whirr is a promosing Apache incubator project for quickly launching cloud instances, from Hadoop to Cassandra, Hbase, Zookeeper and so on. I will show you how to setup a Hadoop cluster and run Mahout jobs both via the command line and Whirr’s Java API (version 0.4).

What is Whirr?

Apache Whirr is a set of libraries and tools that allow you to run different services in the cloud. It currently has support for Amazon EC2 and Rackspace Cloud Servers. Services it currently supports are: Cassandra, Hadoop, Zookeeper, HBase, ElasticSearch and Voldemort.

With whirr you can easily start these cloud services via the command line or a Java API. For example, you specify the following property file:

whirr.service-name=hadoop
whirr.cluster-name=test-cluster
whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode, 2 hadoop-datanode+hadoop-tasktracker
whirr.provider=aws-ec2
whirr.location-id=eu-west-1
whirr.hardware-id=m1.small
whirr.image-id=eu-west-1/ami-1b9fa86f
whirr.identity=${env.AWS_ACCESS_KEY}
whirr.credential=${env.AWS_SECRET_KEY}
whirr.private-key-file=/home/frank/.ssh/id_rsa_whirr
whirr.public-key-file=/home/frank/.ssh/id_rsa_whirr.pub
whirr.cluster-user=frank

and then run the following to start a cluster.

$ whirr launch-cluster --config [path-to-property-file]

When the cluster is started you have to run the proxy script in ~/.whirr/[clustername]/hadoop-proxy.sh and then you can SSH into the cluster.

Why use Whirr with Mahout

Below are some reasons for using Whirr with Mahout specifically. The first is the principle of ‘convention over configuration’. This means that you specify what kind of cluster you want, not the specifics of how to create it. When using Mahout you mostly want to run Hadoop jobs, so you want to start a Hadoop cluster, which is supported out-of-the-box by Whirr.

The second reason is that Whirr enables transparant job submission. It generates a hadoop-site.xml in ~/.whirr/ on your local machine at startup. By pointing HADOOP_CONF_DIR to this directory you can transparently launch jobs from your local machine to the cluster. It’s almost as if your local machine is the cluster. This transparency is of course a Hadoop feature, but Whirr enables this automatically. Compare this with running Mahout on Amazon’s Elastic Map Reduce where you have manual steps like uploading the Mahout jar and referencing the jar’s location in any subsequent command line parameters. With Whirr you can run Mahout jobs from Java without a separate API for launching jobs on Amazon. More on running Mahout jobs from Java later on.

Getting started with Whirr

Step 1 – Prequisites

To get started you first need a Amazon Web Services credentials and install Whirr. Check out Whirr founder Tom White’s Whirr in 5 minutes to see how to get everything up and running.

Step 2 – Overriding Hadoop properties

Whirr allows you to override Hadoop properties. For instance, common changes are increasing heap space and ulimit for tasks. To specify this add the following to Whirr’s property file:

hadoop-mapreduce.mapred.child.java.opts=-Xmx1000m
hadoop-mapreduce.mapred.child.ulimit=1500000

The prefix hadoop-mapreduce specifies that these properties will be written to the mapred-site.xml file on all machines in the cluster. You can also use the hadoop-common and hadoop-hdfs prefix to specify properties belonging to core-site.xml and hdfs-site.xml, respectively. This additional configuration will not be added to the hadoop-site.xml on your local machine, but that’s not a problem.

Step 3 – Whirr’s Java API

You can also use Whirr’s Java API to start a cluster. The snippet below shows how to launch and destroy a cluster based on a given Whirr property file.

ClusterSpec clusterSpec = new ClusterSpec(new PropertiesConfiguration(whirrConfigFile), false);
Service service = new Service();

Cluster cluster = service.launchCluster(clusterSpec);

HadoopProxy proxy = new HadoopProxy(clusterSpec, cluster);
proxy.start();

// Launch jobs

proxy.stop();
service.destroyCluster(clusterSpec);

The constructor loads the specified Whirr property file, launches the cluster and the Hadoop proxy. The hadoop proxy is needed to be able to access the cluster and the JobTracker or NameNode UI at http://[jobtracker]:50030 and http://[namenode]:50070 respectively.

Step 4 – Building the Mahout job jar

Before we can run jobs we need to build the Mahout job jar. A job jar is a Hadoop convention and is a jar that has a lib folder with job-specific dependencies. The Mahout job jar has Lucene as one of its dependencies for example. When you submit a job with Hadoop it will look for a jar on the classpath and submit it to the cluster. If your job jar is not found it might select a normal jar without the dependencies and you will get ClassNotFoundExceptions. To build the Mahout example job jar, run the following commands:

$ svn co http://svn.apache.org/repos/asf/mahout/trunk mahout
$ cd mahout
$ mvn clean install -DskipTests=true

If you want to run a mahout job from your IDE you need to put the mahout job jar on the classpath. In IntelliJ you can add the job jar via ‘Project Structure’ > ‘Dependencies’ > ‘Add Single-Entry Module Library’. If you don’t add the correct job jar to the classpath you get the following warning

“No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String)”

Step 6 – Uploading data

Use the snippet below to create a directory on HDFS and upload data to the cluster.

$ export HADOOP_CONF_DIR=~/.whirr/cluster-name
$ hadoop fs -mkdir input
$ hadoop fs -put seinfeld-scripts input

Now you can upload some data to the cluster via the command line, for instance the Seinfeld dataset from one of my earlier blogs

Step 7 – Loading Hadoop Configuration in Java

The next step is to load a Hadoop Configuration object that points to your cluster. Use the following:

PropertiesConfiguration props = new PropertiesConfiguration(whirrConfigFile);
String clusterName = props.getString("whirr.cluster-name");

Configuration configuration = new Configuration();
configuration.addResource(new Path(System.getProperty("user.home"), ".whirr/" + clusterName + "/" + "hadoop-site.xml"));

This configuration object can now be passed into your Mahout jobs

Step 5 – Run!

The cluster is running, the job jar is built, you can now run a job via your IDE or the command line. To run from Java you can use the ToolRunner to run Mahout’s Driver classes. The arguments to ToolRunner are the Configuration object loaded with values from Whirr and the parameters of the job in a String[]

 String[] seq2SparseParams = new String[] {
            "--input", textOutputPath.toString(),
            "--output", sparseOutputPath,
            "--weight", "TFIDF",
            "--norm", "2",
            "--maxNGramSize", "2",
            "--namedVector",
            "--maxDFPercent", "50",
    };

ToolRunner.run(configuration, new SparseVectorsFromSequenceFiles(), seq2SparseParams);

or you can run mahout from the command line. Enjoy!

P.S. Don’t forget to shutdown your cluster 😉

$ whirr destroy-cluster --config [path-to-property-file]

4 Responses

  1. August 4, 2011 at 15:34 by Grant

    This would make for a good patch!

  2. August 28, 2011 at 18:07 by Frank Scholten

    Hi Grant,

    What kind of patch did you have in mind exactly? The configuration loading code from step 7?

  3. August 28, 2011 at 22:32 by Grant

    Frank,

    Why not make it one of Whirr’s supported technologies, like Hadoop, ElasticSearch, etc.? Then it would have native support for Mahout.

  4. […] Running Mahout in the Cloud using Apache Whirr […]