Trifork Blog

Category ‘Apache Mahout’

An Introduction To Mahout's Logistic Regression SGD Classifier

February 4th, 2014 by
(http://blog.trifork.com/2014/02/04/an-introduction-to-mahouts-logistic-regression-sgd-classifier/)

Mahout-logoThis blog features classification in Mahout and the underlying concepts. I will explain the basic classification process, training a Logistic Regression model with Stochastic Gradient Descent and a give walkthrough of classifying the Iris flower dataset with Mahout.

Read the rest of this entry »

Berlin Buzzwords 2012 Recap

June 7th, 2012 by
(http://blog.trifork.com/2012/06/07/berlin-buzzwords-2012-recap/)

This is a recap of Berlin Buzzwords 2012, the 2 day conference on everything scale, search and store in the NoSQL world. Myself and Martijn van Groningen arrive in Berlin Sunday evening. Unfortunately we are too late for the infamous Barcamp, a low-key mix of lightning talks, beer and socializing, so we decide to have a beer at the hotel instead.

Day 1

So let's fast forward to Monday morning. Berlin Buzzwords kicks off with this year over a record 700 delegates from all over the world. The first keynote session was by Leslie Hawthorn on community building. Using gardening as a metaphor she talks us through slides of gardening tools, plucking weeds and landscaping as activities for building a healthy community. Her advice, display all the 'paths to success' on the site and how to nip in the bud unproductive mailinglist discussions using a project 'mission statement'. My mind wanders off as I think about how I can apply this to Apache Whirr. Then I think about the upcoming release that I still want to do testing for. "Community building is hard work", she says. Point taken.

Storm and Hadoop

Ted Dunning gave an entertaining talk on closing the gap between the realtime features of Storm and the batch nature of Hadoop. Besides that he talked about an approach called Bayesian Bandits for doing real-time A/B testing while he casually performs a coin trick to explain the statistics behind it.

Socializing, coding & the geek-play area

During the rest of the day I visited lots of talks but I occasionally socialize and hang out with my peers. In the hallway I bump into old friends and I meet many new people from the open source community. At other times we wander outside to the geek-and-play area watching people playing table tennis and enjoying the inspiring and laid-back vibe of this pretty unique conference.

Occasionally I feel an urge to do some coding. "You should get a bigger screen", a fellow Buzzworder says as he points at my laptop. I snap out of my coding trance and realize I'm hunched over my laptop, peering at the screen from up close. I'm like most people at the conference, multi-tasking between listening to a talk, intense coding, and sending out a #bbuzz Tweet. There are just so many inspiring things that grab your attention, I can hardly keep up with them.

Mahout at Berlin Buzzwords

Almost every developer from the Mahout development team hangs out at Buzzwords. As I gave my talk, Robin Anil and Grant Ingersoll type up last-minute patches and close the final issues for the 0.7 Mahout release. On stage I discuss the Whirr's Mahout service which automatically installs Mahout on a cluster. In hindsight the title didn't fit that well as I mostly talked about Whirr, not Machine Learning with Mahout. Furthermore, my talk finished way earlier than expected; bad Frank. Note to self: better planning next time.

Day 2

Elasticsearch

Day two features quite a few talks on ElasticSearch. In one session Shay Banon, ElasticSearch' founder, talks about how his framework handles 'Big Data' and he explains the design principles like shard overallocation, performance penalties of splitting your index, and so on. In a different session Lukáš Vlček and Karel Minarik fire up an ElasticSearch cluster during their talk. The big screen updates continuously with stats on every node in the cluster. The crowd cracks up laughing as Clinton Gormley suddenly joins the live cluster using his phone and laptop.

New directions in Mahout

In the early afternoon Simon Willnauer announced that one of the talks had to be cancelled, luckily Ted Dunning could fill this spot by giving another talk. This time the subject was the future of Mahout. He discussed the upcoming 0.7 release which is largely a clean-up and refactoring effort. Additionally he discussed two future contributions: pig-vector and the streaming K-means clustering algorithm.

The idea of pig-vector is to create a glue layer between Pig and Mahout. Currently you run Mahout by first transforming your data, say a directory of text files, to a format that can Mahout can read and then you run the actual Mahout algorithms. The goal for pig-vector is to use Pig to shoehorn your data so Mahout can use it. The benefit of using Pig is that is designed to read data from a lot of sources so Mahout can focus on the actual machine learning.

The upcoming streaming K-means clustering algorithm looks very promising. "No knobs, the system adapts to the data" he says. Ted refers to the problem with existing Mahout algorithms that require a lot tweaking of parameters. On top of that the algorithm is blazingly fast too, using clever tricks like projection search. Even though the original K-means algorithm is easily parallelizable using MapReduce, the downside is that it is iterative and has to make several passes over the data. Very inefficient for large datasets.

See you next year!

This wraps up my coverage of Berlin Buzzwords. There were many more talks and of course so I just covered my personal interest area which amounts largely towards Mahout. The only downside was the weather in Berlin this year but other than that I really enjoyed the conference. Many thanks to everyone involved in organizing the conference and roll on 2013!

Using your Lucene index as input to your Mahout job - Part I

March 5th, 2012 by
(http://blog.trifork.com/2012/03/05/using-your-lucene-index-as-input-to-your-mahout-job-part-i/)

This blog shows you how to use an upcoming Mahout feature, the lucene2seq program or https://issues.apache.org/jira/browse/MAHOUT-944. This program reads the contents of stored fields in your Lucene index and converts them into text sequence files, to be used by a Mahout text clustering job. The tool contains both a sequential and MapReduce implementation and can be run from the command line or from Java using a bean configuration object. In this blog I demonstrate how to use the sequential version on an index of Wikipedia.

Introduction

When working with Mahout text clustering or classification you preprocess your data so it can be understood by Mahout. Mahout contains input tools such as seqdirectory and seqemailarchives for fetching data from different input sources and transforming them into text sequence files. The resulting sequence files are then fed into seq2sparse to create Mahout vectors. Finally you can run one of Mahout's algorithms on these vectors to do text clustering.

The lucene2seq program

Recently a new input tool has been added, lucene2seq, which allows you read from stored fields of a Lucene index to create text sequence files. This is different from the existing lucene.vector program which reads term vectors from a Lucene index and transforms them into Mahout vectors straight away. When using the original text content you can take full advantage of Mahout's collocation identification algorithm which improves clustering results.

Let's look at the lucene2seq program in more detail by running

$ bin/mahout lucene2seq --help

This will print out all the program's options.

Job-Specific Options:                                                           
  --output (-o) output       The directory pathname for output.                 
  --dir (-d) dir             The Lucene directory                               
  --idField (-i) idField     The field in the index containing the id           
  --fields (-f) fields       The stored field(s) in the index containing text   
  --query (-q) query         (Optional) Lucene query. Defaults to               
                             MatchAllDocsQuery                                  
  --maxHits (-n) maxHits     (Optional) Max hits. Defaults to 2147483647        
  --method (-xm) method      The execution method to use: sequential or         
                             mapreduce. Default is mapreduce                    
  --help (-h)                Print out help                                     
  --tempDir tempDir          Intermediate output directory                      
  --startPhase startPhase    First phase to run                                 
  --endPhase endPhase        Last phase to run

The required parameters are lucene directory path(s), output path, id field and list of stored fields. The tool will fetch all documents and create a key value pair where the key equals the value of the id field and the value equals the concatenated values of the stored fields. The optional parameters are a Lucene query, a maximum number of hits and the execution method, sequential or MapReduce. The tool can be run like any other Mahout tool.

Converting an index of Wikipedia articles to sequence files

To demonstrate lucene2seq we will convert an index of Wikipedia articles to sequence files. Checkout the Lucene 3x branch, download a part of the Wikpedia articles dump and run a benchmark algorithm to create an index of the articles in the dump.

$ svn checkout http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x lucene_3x
$ cd lucene_3x/lucene/contrib/benchmark
$ mkdir temp work
$ cd temp
$ wget http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p000000010p000010000.bz2
$ bunzip enwiki-latest-pages-articles1.xml-p000000010p000010000.bz2

The next step is to run a benchmark 'algorithm' to index the Wikipedia dump. Contrib benchmark contains several of these algorithms in the conf directory. For this demo we only index a small part of the Wikipedia index so edit the conf/wikipediaOneRound.alg file so it points to enwiki-latest-pages-articles1.xml-p000000010p000010000. For an overview of the syntax of these benchmarking algorithms check out the benchmark.byTask package-summary Javadocs

Now it's time to create the index

$ cd ..
$ ant run-task -Dtask.alg=conf/wikipediaOneRound.alg -Dtask.mem=2048M

The next step is to run lucene2seq on the generated index under work/index. Checkout the lucene2seq branch from Github

$ git clone https://github.com/frankscholten/mahout
$ git checkout lucene2seq
$ mvn clean install -DskipTests=true

Change back to the lucene 3x contrib/benchmark work dir and run

$ <path/to>/bin/mahout lucene2seq -d index -o wikipedia-seq -i docid -f title,body -q 'body:java' -xm sequential

To create sequence files of all documents that contain the term 'java'. From here you can run seq2sparse followed by a clustering algorithm to cluster the text contents of the articles.

Running the sequential version in Java

The lucene2seq program can also be run from Java. First create a LuceneStorageConfiguration bean and pass in the list of index paths, the sequence files output path, the id field and the list of stored fields in the constructor.

LuceneStorageConfiguration luceneStorageConf = new LuceneStorageConfiguration(configuration, asList(index), seqFilesOutputPath, "id", asList("title", "description"));

You can then optionally set a Lucene query and max hits via setters

luceneStorageConf.setQuery(new TermQuery(new Term("body", "Java")));
luceneStorageConf.setMaxHits(10000);

Now you can run the tool by calling the run method with the configuration as a parameter

LuceneIndexToSequenceFiles lucene2Seq = new LuceneIndexToSequenceFiles();
lucene2seq.run(luceneStorageConf); 

Conclusions

In this post I showed you how to use lucene2seq on an index of Wikipedia articles. I hope this tool will make it easier for you to start using Mahout text clustering. In a future blog post I discuss how to run the MapReduce version on very large indexes. Feel free to post comments or feedback below.

Apache Whirr includes Mahout support

December 22nd, 2011 by
(http://blog.trifork.com/2011/12/22/apache-whirr-includes-mahout-support/)

In a previous blog I showed you how to use Apache Whirr to launch a Hadoop cluster in order to run Mahout jobs. This blog shows you how to use the Mahout service from the brand new Whirr 0.7.0 release to automatically install Hadoop and the Mahout binary distribution on a cloud provider such as Amazon.

Introduction

If you are new to Apache Whirr checkout my previous blog which covers Whirr 0.4.0. A lot has changed since then. After several services, bug fixes, improvements Whirr became a top level Apache project with its new version 0.7.0 released yesterday! During the last weeks I worked on a Apache Mahout service for Whirr included in the latest release. (Thanks to the Whirr community and Andrei Savu in particular for reviewing the code and helping out to ship this cool feature!)

How to use the Mahout service

The Mahout service in Whirr defines the mahout-client role. This role will install the binary Mahout distribution on a given node. To use this feature checkout the sources from https://svn.apache.org/repos/asf/whirr/trunk or http://svn.apache.org/repos/asf/whirr/tags/release-0.7.0/ or clone the project with Git at http://git.apache.org/whirr.git and build it with a mvn clean install. Let me walk you through an example how to use this on Amazon AWS.

Step 1 Create a node template

Create a file called mahout-cluster.properties and add the following

whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode+mahout-client,2 hadoop-datanode+hadoop-tasktracker

whirr.provider=aws-ec2
whirr.identity=TOP_SECRET
whirr.credential=TOP_SECRET

This setup configures two Hadoop datanode / tasktrackers and one Hadoop namenode / jobtracker / mahout-client node. For the mahout- client role, Whirr will:

* Download the binary distribution from Apache and install it under /usr/local/mahout

* Set MAHOUT_HOME to /usr/local/mahout

* Add $MAHOUT_HOME/bin to the PATH

(Optional) Configure the Mahout version and / or distribution url

By default, Whirr will download the Mahout distribution from
http://archive.apache.org/dist/mahout/0.5/mahout-distribution-0.5.tar.gz
You can override the version by adding
whirr.mahout.version=VERSION

Also, you can change the download url entirely; useful if you want to test your own version of Mahout. To do so, first create a Mahout binary distribution by entering the mahout distribution folder in your checked out Mahout source tree and run

$ mvn clean install -Dskip.mahout.distribution=false

Now put the tarball on a server that will be accessible by the cluster and add the following line to your mahout-cluster.properties

whirr.mahout.tarball.url=MAHOUT_TARBALL_URL

Step 2 Launch the cluster

You can now launch the cluster the regular way by running:

$ whirr launch-cluster --config mahout-cluster.properties

Step 3 Login & run

When the cluster is setup, run the Hadoop proxy, upload some data, SSH into the node and voilà, you can run Mahout jobs by invoking the command line script like you would do normally, such as:

$ mahout seqdirectory --input input --output output

Enjoy!

Running Mahout in the Cloud using Apache Whirr

June 21st, 2011 by
(http://blog.trifork.com/2011/06/21/running-mahout-in-the-cloud-using-apache-whirr/)

This blog shows you how to run Mahout in the cloud, using Apache Whirr. Apache Whirr is a promosing Apache incubator project for quickly launching cloud instances, from Hadoop to Cassandra, Hbase, Zookeeper and so on. I will show you how to setup a Hadoop cluster and run Mahout jobs both via the command line and Whirr's Java API (version 0.4).

Read the rest of this entry »

How to cluster Seinfeld episodes with Mahout

April 4th, 2011 by
(http://blog.trifork.com/2011/04/04/how-to-cluster-seinfeld-episodes-with-mahout/)

This february I gave a talk on Mahout clustering at FOSDEM 2011 where I demonstrated how to cluster Seinfeld episodes. A few people wanted to know how to run this example so I write up a short blog about it. In just a few minutes you can run the Seinfeld demo on your own machine.
Read the rest of this entry »

Mahout at FOSDEM 2011 DataDevRoom

February 10th, 2011 by
(http://blog.trifork.com/2011/02/10/mahout-at-fosdem-2011-datadevroom/)

Last saturday, february 5th, FOSDEM 2011 hosted the DataDevRoom where talks were given on topics surrounding data analysis with free and open source software. I was there and gave an introductory talk on clustering with Apache Mahout. In case you missed the conference, read on to learn about some of the talks or checkout the slides or demo code from my Mahout talk.

Read the rest of this entry »

Announcement: Lucene-NL Mahout meetup with Isabel Drost - Feb 7

January 13th, 2011 by
(http://blog.trifork.com/2011/01/13/announcement-lucene-nl-mahout-meetup-with-isabel-drost-feb-7/)

On february 7th, the Dutch Lucene user group meeting, Lucene-NL, organizes a meetup on the topic of scalable machine learning using Apache Mahout. Isabel Drost, Mahout PMC and co-founder will visit us from Berlin and give a talk on Mahout classification. Frank Scholten from JTeam will introduce Mahout clustering. Curious about Mahout, Hadoop, MapReduce and scalable data? Join us at the meetup to learn and discuss these exciting technologies!

Attendance is free, but registration is required. Read on for details.

Read the rest of this entry »

Mahout – Taste :: Part Three – Estimators

July 8th, 2010 by
(http://blog.trifork.com/2010/07/08/mahout-%e2%80%93-taste-part-three-%e2%80%93-estimators/)

In Taste, estimators are the bridge between the generic item- or user recommendation logic and the specific similarity algorithm. Estimators are mainly used as part of the recommendation process, however, they are also used for evaluating recommenders. Additionally, the 'recommended because' feature is also powered by an estimator. This blog covers some Taste internals and shows you how estimators are used within Taste via a few code samples.

Read the rest of this entry »

Mahout - Taste at Lucene Eurocon and Berlin Buzzwords

July 1st, 2010 by
(http://blog.trifork.com/2010/07/01/mahout-taste-at-lucene-eurocon-and-berlin-buzzwords/)

A little while ago, I was delighted to present two introductory Mahout - Taste talks, at Lucene Eurocon and Berlin Buzzwords. I received quite a lot of good feedback about the presentations and have been asked by a few attendees to post them.

If you're one of those attendees or you missed the presentation, you can download the slides here:

At Lucene Eurocon, the first European conference on Lucene and Solr there were interesting presentations, ranging from practical relevance to language analysis. For me it was fun to give a practical presentation about recommendations as a complementary feature to search applications. I hope you find the presentation useful if you're trying to work out how to build a recommender - I used the movielens dataset as an example in the presentation and based the code on my earlier 'getting started' blog.

I also really enjoyed doing the Berlin Buzzwords presentation and meeting up with people from the Mahout community and other attendees. This conference focused mainly on NoSQL, scalability and Hadoop. However, from my talks with people there I sense that there's growing interest in Mahout. You should find the presentation useful if you want to know more about different algorithms and how to evaluate them. I will blog about this topic in more detail soon.

Until then, I'd love to hear some feedback on what you think of the presentations!