Last week was the week of the Berlin Buzzwords conference. For those of you that do not know it, Buzzwords is one of the best conferences about search, scale and store (read: big data). Next to having good speakers and nice content there is also a very important social aspect during the conference. There is a good reason the first session starts at 10 o’clock.
On sunday evening Byron, Bram and myself took the plane to Berlin. We stayed in a nice hotel close to the hotel where most of our friends from elasticsearch were. Before going to sleep we had a beer and after a good breakfast we were ready for day one.
The opening of the conference as well as the keynote were in the Kino. Simon Willnauer and Isabel Drost-Fromm both recently got a fresh youngster in their family, Isabel was even taking the baby along. They introduced us to the sponsors and Trifork was one of them. They also introduced the first speaker for the keynote.
The keynote was given by Ralf Herbrich who is working at Amazon right now. He gave a presentation about Technology Transfer in Practice. He used to work at Microsoft on the Halo player ranking system. He talked about the improvements they made in ranking players based on the matches they played and the results they got during the game. That way they could put players against each other that matched in skills for a better gaming experience. It was a tough presentation with some nice parts in it. A little bit to much very hard formulas for a keynote if you ask me. But the idea’s they had were clear and interesting for a gamer.
The takeaway for this session was a remark Ralf made about doing research to create something really new and refreshing. When creating something really new, don’t start reading literature. Only after you have created something without the bias of other peoples ideas, check out the existing literature. Often you will find that someone else has already done it and had similar ideas. But, you will never find a breakthrough in literature.
Next up, coffee break. Time to inspect the Trifork booth and talk to some people and of course take a coffee. Nice to have all the beverages for free. Also nice to enjoy the sun.
The next talk I attended was from Clinton Gormley about Elasticsearch Query DSL. Clinton has a big part in the excellent elasticsearch – The definitive Guide. During his presentation he explained the query dsl basics but also some more advanced aspects. Some things I wrote down for this session are:
- The filter cache is smart, you do not have to rebuild it on each change. It will be adjusted on every change in the index. The filter cache is what makes using filters extremely fast.
- Bool queries can be used to create very complicated and very powerful filters. The bool filters themselves are not cached. But the queries that compose the bool query are cached. That way reusing them is very efficient.
- You can add fuzziness to a query to overcome typos. You have to provided an edit distance for the level of fuzziness. In the end for the amount of positions a character can be different in the found term related to the term you are looking for. If you have small words a larger edit distance can give strange results. Putting auto on it will determine the edit distance based on the word length.
This was one of the better presentations of the conference if you ask me. Next up, elasticsearch aggregations by Adrien Grand. This was just a short talk of 20 minutes about one of the most important features of elasticsearch. Adrien showed the power of composed aggregations. This means that you can use an aggregation within an aggregation and yes within an aggregation and so on. He also explained why this is so cool with examples. The best example is not yet released but is committed to the master. This is the top hits aggregation. Using this you can create buckets of documents using any aggregation. Than you can calculate the score of the different documents based on your query. This way you can return the most read articles for each day, but also the best matching restaurants in your surroundings by category of restaurants.
After the lunch I attended the talk by Alan Woodward about Turning search upside down. The idea behind this talk was not to look for documents but to look for queries. An interesting talk about a framework called luwak. Luwak gives lucene/solr a bit of what the percolator is for elasticsearch. I liked the part about the researcher. This makes it possible to rule out big numbers of queries based on easy and therefore fast filters.
The next session was an interesting session about randomising tests by Dawid Weiss. Too bad Dawid only had 20 minutes. To short to introduce a nice concept like this. Some of the remarks I wrote down during his talk:
- Randomized tests do not replace unit tests. They also does not add code coverage of your testes. I do think by going through more corner cases you can reach code you did not reach before.
- You should keep your tests running, not just on checkin, but really continuously.
- What can be randomized? Input values, Iteration counts, Arguments
- If you have multiple implementations in your code base, be sure to randomize their usage.
- JUnit comes with a Randomized runner. Interesting with this runner is that it uses a seed to randomise. This seed is returned, in case the test brakes you have the seed (the part that randomises you test). That way you can replay the test with the values that break it.
You can find more information here: http://labs.carrotsearch.com/randomizedtesting.html. The next session was about the elasticsearch percolator by Martijn van Groningen. Martijn is one of those guys that went from Trifork to elasticsearch when the company got started. Always nice to see (ex)colleagues on the stage. The percolator can be compared to Luwak that is discussed before. I like Martijn’s idea using the percolator to classify documents before they are added to the index. Also nice to see you can do scoring for the best matched query, highlighting (cannot see why yet) and filtering to make the lookup extremely fast. A very nice piece of technology that I want to use more in the future.
After some good presentations it was time to relax, drink coffee or water and of course play table soccer. Not the strict dutch (or international) rules. But the German rules. If the ball is in the pocket, it is a goal. A lot easier to understand, but also a lot harder for defensive play. Elasticsearch provided a nice branded table next to their smoothy bicycle.
After the break the final two sessions of the day. I attended the session from Michael Kaisser about Geospatial analysis of social media. A very entertaining talk. He had a nice demo that is available online as well.
Then, at the end of the day Boaz Leskes gave us a nice presentation about scaling elasticsearch. As always a nice presentation by Boaz showing the options you have for scaling out.
After a long day with lots of good information it was time for beer. Daphne joined us for some wine and I can tell you that we had a very good party. Especially the burgers were great (not, sorry). Still the beer was good, we loved the coins we needed to give back together with a glass as Phand. Luckily there were no pictures from the party :-).
Some of us started with a nice breakfast, while others used as much time as possible to stay in bed. The day started with a keynote by Katherine Daniels, Devops is Dead: Long Live Devops. She delivered a nice keynote with several nice statements that require repeating:
Tools can’t make people talk to each other, tools also cannot make people listen to each other.
Don’t create devops teams, devops is a culture that the complete company needs to carry forward.
Don’t use the term Devops, use Getting things done without being a bastard
The next presentation I attended was by Michael Bush about Search at Twitter. Especially interesting in this talk was the mechanism they used to provided real time search. So not the near real time search as provided by elasticsearch for instance. This technology is called the earlybird project.
During the second day I wrote down less notes than for the first day. That did not mean they were less interesting in general. But for me, most of them were less inspiring. The final presentation I want to mention is the best one for me during the conference by Britta Weber about Scoring for human beings. She explained the mechanism of using vectors to calculate the distance between terms in a query and in the found documents. She also showed the impact of tf/idf on the vector and in the end the impact on the score for a document.
Britta also explained why the default tf/idf similarity is not always good enough. A good example of scoring based on text is not always what you need can be found here: http://colors.qbox.io. Another example I liked was about problems when scoring cv’s of people based on the languages they can program in. If you know tf/idf, you know they longer fields have a negative impact on the score. So imagine you are looking for a person that can program java. One person only learned java and therefore only has java on his cv. Another person has java, iOS, Android, Ruby, Python, c++ on his cv. In this case the person with only java on his cv has a better score. I am not sure, but this might not be the person you want to find on top of the other one.
Another thing she showed is using scripts with elasticsearch. Especially the interface to your _index from within scripts is very interesting and something I need to have a good look at.
That was my take on two days Berlin. In the evening we were invited by elasticsearch to have dinner at a nice italian restaurant. There I had some nice and interesting conversations about technology but also other things. Tried to get to bed before eleven since we had to take a cab to the airport at 04:00 in the morning. All in all, it was a great conference with many interesting people. We are looking forward to next year!