Trifork Blog

Posts Tagged ‘Hadoop’

GOTO Update – GOTO Night Docker & Conference update

January 23rd, 2015 by

Docker – An Introduction & Its Uses

Monday 19th of January we held the first GOTO Night of 2015. This evening was about Docker and was hosted by ABN AMRO.

ABN AMRO welcomes everybody

Over 110 registered users, listened to Adrian Mouat explaining why Docker is such an exciting technology. After the break we had an interactive session between the panel, consisting of Jamie Dobson and Adrian Mouat, and the audience. This Question & Answer session was led by Mark Coleman. Read the rest of this entry »

NoSQL Roadshow in Amsterdam

October 10th, 2012 by

Are you frustrated by growing data requirements and interested in how non-relational databases could help? Curious about where and how NoSQL systems are being deployed? Want to build a “real” Highly Scalable System? Well if you answered yes to any of these questions you’re not the only one who wants to know more. For this very reason we’ve decided to team up with some of the best NoSQL experts and bring to you the NoSQL Roadshow Amsterdam. It’s been a great success in Stockholm, Copenhagen, Zurich & Basel and hence the roadtrip continues in Amsterdam and thereafter in London.

This informative and intensive 1 day session with 10 presentations is designed to give you an overview of the changing landscape around data management highlighted by concepts like “Big Data” and NoSQL databases, that are based on non-traditional, sometimes non-relational, other times very relational data models. The NoSQL Roadshow aims at IT professionals who are interested in faster and cheaper solutions to managing fast growing amounts of data.

After introducing the landscape and the business problems, you will learn how to attack these growing issues and hear first-hand how organisations were able to solve their modern data problems with innovative solutions like Neo4j, Riak, MongoDB, Cassandra, and many more.

NoSQL Roadshow Amsterdam is also a excellent opportunity for CIOs, CTOs, developers and NoSQL ambassadors to meet and discuss their own experiences with NoSQL – across industries and it’s centrally located at the Pakhuis de Zwijger in the centre of the city.

We have limited capacity, so grab your seats today for Thursday 29th November and sign up for the early bird 50% discount rate of only 150 EUR.

For more information visit the website. If you can not make the 29th November a week later on 6th December there is also an opportunity to sign up for the roadshow in London.

We hope to see you there and help you get started with your big data decisions!

Frank Scholten joins Apache Whirr development team

March 8th, 2012 by

I am pleased to announce that I have been voted in as a committer on Apache Whirr! Whirr is a Java library for quickly setting up services in the cloud. For example, using Whirr you can start a Hadoop cluster on Amazon in 5 minutes by configuring a simple property file and running the whirr command-line tool. See the quick start guide for more information.

Hadoop is only one of the supported services however. Whirr supports several NoSQL databases or distributed computing platforms and tools. Currently Whirr supports HBase, Hama, Ganglia, Zookeeper, ElasticSearch, Mahout, Puppet, Chef and Voldemort.

One of my contributions was the Mahout service which installs the Mahout binary distribution on a given node. When used in conjunction with Hadoop you can have a fully operational Mahout cluster in minutes. For more information about using the Mahout service checkout this blog on Mahout support in Whirr on the community site

More services are continuously being added to Whirr. For instance the Solr and MongoDB services are planned for the upcoming 0.8.0 release. If you would like to know and keep up to date with more about Whirr checkout the project page or subscribe to the mailinglist.

Using your Lucene index as input to your Mahout job – Part I

March 5th, 2012 by

This blog shows you how to use an upcoming Mahout feature, the lucene2seq program or This program reads the contents of stored fields in your Lucene index and converts them into text sequence files, to be used by a Mahout text clustering job. The tool contains both a sequential and MapReduce implementation and can be run from the command line or from Java using a bean configuration object. In this blog I demonstrate how to use the sequential version on an index of Wikipedia.


When working with Mahout text clustering or classification you preprocess your data so it can be understood by Mahout. Mahout contains input tools such as seqdirectory and seqemailarchives for fetching data from different input sources and transforming them into text sequence files. The resulting sequence files are then fed into seq2sparse to create Mahout vectors. Finally you can run one of Mahout’s algorithms on these vectors to do text clustering.

The lucene2seq program

Recently a new input tool has been added, lucene2seq, which allows you read from stored fields of a Lucene index to create text sequence files. This is different from the existing lucene.vector program which reads term vectors from a Lucene index and transforms them into Mahout vectors straight away. When using the original text content you can take full advantage of Mahout’s collocation identification algorithm which improves clustering results.

Let’s look at the lucene2seq program in more detail by running

$ bin/mahout lucene2seq --help

This will print out all the program’s options.

Job-Specific Options:                                                           
  --output (-o) output       The directory pathname for output.                 
  --dir (-d) dir             The Lucene directory                               
  --idField (-i) idField     The field in the index containing the id           
  --fields (-f) fields       The stored field(s) in the index containing text   
  --query (-q) query         (Optional) Lucene query. Defaults to               
  --maxHits (-n) maxHits     (Optional) Max hits. Defaults to 2147483647        
  --method (-xm) method      The execution method to use: sequential or         
                             mapreduce. Default is mapreduce                    
  --help (-h)                Print out help                                     
  --tempDir tempDir          Intermediate output directory                      
  --startPhase startPhase    First phase to run                                 
  --endPhase endPhase        Last phase to run

The required parameters are lucene directory path(s), output path, id field and list of stored fields. The tool will fetch all documents and create a key value pair where the key equals the value of the id field and the value equals the concatenated values of the stored fields. The optional parameters are a Lucene query, a maximum number of hits and the execution method, sequential or MapReduce. The tool can be run like any other Mahout tool.

Converting an index of Wikipedia articles to sequence files

To demonstrate lucene2seq we will convert an index of Wikipedia articles to sequence files. Checkout the Lucene 3x branch, download a part of the Wikpedia articles dump and run a benchmark algorithm to create an index of the articles in the dump.

$ svn checkout lucene_3x
$ cd lucene_3x/lucene/contrib/benchmark
$ mkdir temp work
$ cd temp
$ wget
$ bunzip enwiki-latest-pages-articles1.xml-p000000010p000010000.bz2

The next step is to run a benchmark ‘algorithm’ to index the Wikipedia dump. Contrib benchmark contains several of these algorithms in the conf directory. For this demo we only index a small part of the Wikipedia index so edit the conf/wikipediaOneRound.alg file so it points to enwiki-latest-pages-articles1.xml-p000000010p000010000. For an overview of the syntax of these benchmarking algorithms check out the benchmark.byTask package-summary Javadocs

Now it’s time to create the index

$ cd ..
$ ant run-task -Dtask.alg=conf/wikipediaOneRound.alg -Dtask.mem=2048M

The next step is to run lucene2seq on the generated index under work/index. Checkout the lucene2seq branch from Github

$ git clone
$ git checkout lucene2seq
$ mvn clean install -DskipTests=true

Change back to the lucene 3x contrib/benchmark work dir and run

$ <path/to>/bin/mahout lucene2seq -d index -o wikipedia-seq -i docid -f title,body -q 'body:java' -xm sequential

To create sequence files of all documents that contain the term ‘java’. From here you can run seq2sparse followed by a clustering algorithm to cluster the text contents of the articles.

Running the sequential version in Java

The lucene2seq program can also be run from Java. First create a LuceneStorageConfiguration bean and pass in the list of index paths, the sequence files output path, the id field and the list of stored fields in the constructor.

LuceneStorageConfiguration luceneStorageConf = new LuceneStorageConfiguration(configuration, asList(index), seqFilesOutputPath, "id", asList("title", "description"));

You can then optionally set a Lucene query and max hits via setters

luceneStorageConf.setQuery(new TermQuery(new Term("body", "Java")));

Now you can run the tool by calling the run method with the configuration as a parameter

LuceneIndexToSequenceFiles lucene2Seq = new LuceneIndexToSequenceFiles();; 


In this post I showed you how to use lucene2seq on an index of Wikipedia articles. I hope this tool will make it easier for you to start using Mahout text clustering. In a future blog post I discuss how to run the MapReduce version on very large indexes. Feel free to post comments or feedback below.

Berlin Buzzwords 2012

January 11th, 2012 by

Yes, Berlin Buzzwords is back on the 4th & 5th June 2012! This really is only conference for developers and users of open source software projects, focusing on the issues of scalable search, data-analysis in the cloud and NoSQL-databases. All the talks and presentations are specific to three tags; “search”, “store” and “scale”.

Looking back to last year, this event had a great turnout. There were well over 440 attendees, of which 130 internationals (from all over including Israel, US, UK, NL, Italy, Spain, Austria and more) and an impressive show of 48 speakers. It was a 2 day event covering 3 tracks with high quality talks, but was surrounded with 5 days of workshops, 10 evening events for attendees to mingle with locals, specialized training opportunities and these are just a few of the activities that were on offer!

What was the outcome? Well let the feedback from some of the delegates tell the story:

“Buzzwords was awesome. A lot of great technical speakers, plenty of interesting attendees and friends, lots of food and fun beer gardens in the evening. I can’t wait until next year!“

“I can’t recommend this conference enough. Top industry speakers, top developers and fantastic organization. Mark this event on your sponsoring calendar!“

“Berlin Buzzwords is by far one of the best conferences around if you care about search, distributed systems, and nosql…“

“Thanks for organizing. My goal was to learn and I learned a lot!“

So to get the ball rolling for this year the call for papers has now officially opened via the website.

You can submit talks on the following topics:

  •  IR / Search – Lucene, Solr, katta, ElasticSearch or comparable solutions
  •  NoSQL – like CouchDB, MongoDB, Jackrabbit, HBase and others
  •  Hadoop – Hadoop itself, MapReduce, Cascading or Pig and relatives

Related topics not explicitly listed above are also more than welcome I’ve been told. The requirements are for presentations on the implementation of the systems themselves, technical talks, real world applications and case studies.

What’s more this year there is once again an impressive Program Committee consisting of:

  • Isabel Drost (Nokia, Apache Mahout)
  • Jan Lehnardt (CouchBase, Apache CouchDB)
  • Simon Willnauer (SearchWorkings, Apache Lucene)
  • Grant Ingersoll (Lucid Imagination, Apache Lucene)
  • Owen O’Malley (Hortonworks Inc., Apache Hadoop)
  • Jim Webber (Neo Technology, Neo4j)
  • Sean Treadway (Soundcloud)

For more information, submission details and deadlines visit the conference website.

I am truly looking forward to this event, hope to see you there too!

Running Mahout in the Cloud using Apache Whirr

June 21st, 2011 by

This blog shows you how to run Mahout in the cloud, using Apache Whirr. Apache Whirr is a promosing Apache incubator project for quickly launching cloud instances, from Hadoop to Cassandra, Hbase, Zookeeper and so on. I will show you how to setup a Hadoop cluster and run Mahout jobs both via the command line and Whirr’s Java API (version 0.4).

Read the rest of this entry »

Announcing Dutch Lucene User Group

August 26th, 2009 by

In the last 3 years we’ve witnessed the rise of open source enterprise search. Of course it was always there, and Apache Lucene in particular was there since, well… the previous century. But in the last 3 years the interest in this area has grown dramatically and the install/user base of the different Lucene related projects (Lucene Java and Solr in particular) has grown at an amazing rate. Today, the Lucene ecosystem is booming – there’s a high demand for expertise in this field, yet still there is relatively low supply. The Lucene / Solr mailing lists are flooded with hundreds of questions each week and the need to share knowledge is evident.

Read the rest of this entry »

Introduction to Hadoop

August 4th, 2009 by

Recently I was playing around with Hadoop, after a while I really recognized that this was a great technology. Hadoop allows you to write and run your application in a distributed manner and process large amounts of data with it. It consists out of a MapReduce implementation and a distributed file system. Personally I did not have any experience with distributed computing beforehand, but I found MapReduce quiet easily to understand.

In this blog post I will give an introduction to Hadoop by showing a relative simple MapReduce application. This application will count the unique tokens inside text files. With this example I will try to explain how Hadoop works. Before we start creating our example application we need to know the basics of MapReduce itself.

Read the rest of this entry »