Last saturday, february 5th, FOSDEM 2011 hosted the DataDevRoom where talks were given on topics surrounding data analysis with free and open source software. I was there and gave an introductory talk on clustering with Apache Mahout. In case you missed the conference, read on to learn about some of the talks or checkout the slides or demo code from my Mahout talk.
The DataDevRoom was packed! There were 60 chairs available but I estimate that our group had around 100 people most of the time during the day. Everyone had really interesting stuff to talk about, with subjects ranging from NoSQL database benchmarks, to case studies, to introductions on all sort of free or open source tools for processing, analyzing and visualizing data.
I found the Seeks talk by Emmanual Benazara and the S4 talk by Michaël Figuière particularly interesting. Seeks is a social search engine where users’ queries and search results are shared to improve the search experience. It also has nice features such as clustering and recommendation of search results. The Lucene and S4 talk discussed how to create a search engine by combining Lucene with Yahoo’s recently released distributed stream platform S4. S4’s stream processing was presented as an alternative to large scale batch processing and looks promising.
Below are links to the DataDevRoom program as well as my slides and demo code. My demo consisted of using Apache Mahout to cluster the transcripts of Seinfeld episodes.
- Check out the seinfeld_demo branch at my GitHub repository.
To wrap up, I want to thank the organizers of the DataDevRoom: Olivier Grisel, Nicolas Maillot and Isabel Drost for arranging everything. I had a good time and learned a lot!
Your slides don’t give a lot of detail on the Seinfeld example … it’s worth pointing out the screencast file in https://github.com/frankscholten/mahout/tree/seinfeld_demo/examples gives a lot more detail.