This is a recap of Berlin Buzzwords 2012, the 2 day conference on everything scale, search and store in the NoSQL world. Myself and Martijn van Groningen arrive in Berlin Sunday evening. Unfortunately we are too late for the infamous Barcamp, a low-key mix of lightning talks, beer and socializing, so we decide to have a beer at the hotel instead.
So let’s fast forward to Monday morning. Berlin Buzzwords kicks off with this year over a record 700 delegates from all over the world. The first keynote session was by Leslie Hawthorn on community building. Using gardening as a metaphor she talks us through slides of gardening tools, plucking weeds and landscaping as activities for building a healthy community. Her advice, display all the ‘paths to success’ on the site and how to nip in the bud unproductive mailinglist discussions using a project ‘mission statement’. My mind wanders off as I think about how I can apply this to Apache Whirr. Then I think about the upcoming release that I still want to do testing for. “Community building is hard work”, she says. Point taken.
Storm and Hadoop
Ted Dunning gave an entertaining talk on closing the gap between the realtime features of Storm and the batch nature of Hadoop. Besides that he talked about an approach called Bayesian Bandits for doing real-time A/B testing while he casually performs a coin trick to explain the statistics behind it.
Socializing, coding & the geek-play area
During the rest of the day I visited lots of talks but I occasionally socialize and hang out with my peers. In the hallway I bump into old friends and I meet many new people from the open source community. At other times we wander outside to the geek-and-play area watching people playing table tennis and enjoying the inspiring and laid-back vibe of this pretty unique conference.
Occasionally I feel an urge to do some coding. “You should get a bigger screen”, a fellow Buzzworder says as he points at my laptop. I snap out of my coding trance and realize I’m hunched over my laptop, peering at the screen from up close. I’m like most people at the conference, multi-tasking between listening to a talk, intense coding, and sending out a #bbuzz Tweet. There are just so many inspiring things that grab your attention, I can hardly keep up with them.
Mahout at Berlin Buzzwords
Almost every developer from the Mahout development team hangs out at Buzzwords. As I gave my talk, Robin Anil and Grant Ingersoll type up last-minute patches and close the final issues for the 0.7 Mahout release. On stage I discuss the Whirr’s Mahout service which automatically installs Mahout on a cluster. In hindsight the title didn’t fit that well as I mostly talked about Whirr, not Machine Learning with Mahout. Furthermore, my talk finished way earlier than expected; bad Frank. Note to self: better planning next time.
Day two features quite a few talks on ElasticSearch. In one session Shay Banon, ElasticSearch’ founder, talks about how his framework handles ‘Big Data’ and he explains the design principles like shard overallocation, performance penalties of splitting your index, and so on. In a different session Lukáš Vlček and Karel Minarik fire up an ElasticSearch cluster during their talk. The big screen updates continuously with stats on every node in the cluster. The crowd cracks up laughing as Clinton Gormley suddenly joins the live cluster using his phone and laptop.
New directions in Mahout
In the early afternoon Simon Willnauer announced that one of the talks had to be cancelled, luckily Ted Dunning could fill this spot by giving another talk. This time the subject was the future of Mahout. He discussed the upcoming 0.7 release which is largely a clean-up and refactoring effort. Additionally he discussed two future contributions: pig-vector and the streaming K-means clustering algorithm.
The idea of pig-vector is to create a glue layer between Pig and Mahout. Currently you run Mahout by first transforming your data, say a directory of text files, to a format that can Mahout can read and then you run the actual Mahout algorithms. The goal for pig-vector is to use Pig to shoehorn your data so Mahout can use it. The benefit of using Pig is that is designed to read data from a lot of sources so Mahout can focus on the actual machine learning.
The upcoming streaming K-means clustering algorithm looks very promising. “No knobs, the system adapts to the data” he says. Ted refers to the problem with existing Mahout algorithms that require a lot tweaking of parameters. On top of that the algorithm is blazingly fast too, using clever tricks like projection search. Even though the original K-means algorithm is easily parallelizable using MapReduce, the downside is that it is iterative and has to make several passes over the data. Very inefficient for large datasets.
See you next year!
This wraps up my coverage of Berlin Buzzwords. There were many more talks and of course so I just covered my personal interest area which amounts largely towards Mahout. The only downside was the weather in Berlin this year but other than that I really enjoyed the conference. Many thanks to everyone involved in organizing the conference and roll on 2013!