Trifork Blog

Posts Tagged ‘Apache Solr’

Migrating Apache Solr to Elasticsearch

January 29th, 2013 by
(http://blog.trifork.com/2013/01/29/migrating-apache-solr-to-elasticsearch/)

Solar_Elasticsearch_ConcToolElasticsearch is the innovative and advanced open source distributed search engine, based on Apache Lucene. Over the past several years, at Trifork we have been doing a lot of search implementations. Driven by the fact that every other customer wanted the ‘Google-experience’ (just a text box, type some text and get relevant results) as part of their application, we started by building our own solutions on top of Apache Lucene. That worked quite well as Lucene is the defacto standard when it comes to information retrieval. But soon enough, due to Amazon, CNet and Funda in The Netherlands, people wanted to offer their users more ways to drill down into the search results by using facets. We briefly started our own (currently discontinued) open source project: FacetSearch, but quickly Solr started getting some traction and we decided to jump on that bandwagon.

Starting with Solr

So it was then we started using Solr for our projects and started to be vocal about our capabilities, that led to even more (international) Solr consultancy and training work. And as Trifork is not in the game to just use open source, but also contribute back to the community, this has led to several contributions (spatial, grouping, etc) and eventually having several committers on the Lucene (now including Solr) project.

We go back a long way…

At the same time we were well into Solr, Shay Banon, who we knew from our SpringSource days, started creating his own scalable search solution, Elasticsearch. Although, from a technical perspective a better choice for building scalable search solutions, we didn’t adopt it from the beginning. The main reason for this was that it was basically a one-man show (a veery good one at that I might add!). However, we didn’t feel comfortable recommending Elasticsearch to our customers as if Shay got hit by a bus, it would mean the end of the project. However, luckily all this changed when Shay and some of the old crew from the JTeam (the rest of JTeam is now Trifork Amsterdam) decided to join forces and launch Elasticsearch.com, the commercial company behind Elasticsearch. Now, its all systems go and what was then our main hurdle has been removed and we can use Elasticsearch and moreover guarantee continuity for the project.

Switching from Solr to Elasticsearch

Obviously we are not alone in the world and not that unique in our opinions, so we were not the only ones to change our strategy around search solutions. Many others started considering Elasticsearch, doing comparisons and eventually switching from Solr to Elasticsearch. We still regularly get requests on helping companies make the comparison. And although there are still reasons why you may want to go for Solr, in the majority of cases (especially when scalability and realtime is important) the balance more often than not goes in favor of Elasticsearch.

This is why Luca Cavanna from Trifork has written a plugin (river) for Elasticsearch that will help you migrate from your existing Solr to Elasticsearch. Basically, from Elasticsearch pulling the content from an existing Solr cluster and indexing it in Elasticsearch. Using this plugin will allow you to easily setup an Elasticsearch cluster next to your existing Solr. This will help you get up to speed quickly and therefore enables a smooth transition. Obviously, this tool is used mostly for that purpose, to help you get started. When you decide to switch to Elasticsearch permanently, you would obviously switch your indexing to directly index content from your sources to Elasticsearch. Keeping Solr in the middle is not a recommended setup.
The following description on how to use it is taken from the README.md file of the Solr to Elasticsearch river / plugin.

Getting started

First thing you need to do is: download the plugin

Then create a directory called solr-river in the plugins folder of Elasticsearch (and create it in the elasticsearch home folder, if it does not exist yet). Next, unzip and put the contents of the ZIP file (all the JAR files) in the created folder.

Configure the river

The Solr River allows to query a running Solr instance and index the returned documents in elasticsearch. It uses the Solrj library to communicate with Solr.

It’s recommended that the solrj version used is the same as the solr version installed on the server that the river is querying. The Solrj version in use and distributed with the plugin is 3.6.1. Anyway, it’s possible to query other Solr versions. The default format used is in fact javabin but you can solve compatibility issues just switching to the xml format using the wt parameter.

All the common query parameters are supported.

The solr river is not meant to keep solr and elasticsearch in sync, that’s why it automatically deletes itself on completion, so that the river doesn’t start up again at every node restart. This is the default behaviour, which can be disabled through the close_on_completion parameter.

Installation

Here is how you can easily create the river and index data from Solr, just providing the solr url and the query to execute:

curl -XPUT localhost:9200/_river/solr_river/_meta -d '
{
    "type" : "solr",
    "solr" : {
        "url" : "http://localhost:8080/solr/",
        "q" : "*:*"
    }
}'

All supported parameters are optional. The following example request contains all the parameters that are supported together with the corresponding default values applied when not present.

{
    "type" : "solr",
    "close_on_completion" : "true",
    "solr" : {
        "url" : "http://localhost:8983/solr/",
        "q" : "*:*",
        "fq" : "",
        "fl" : "",
        "wt" : "javabin",
        "qt" : "",
        "uniqueKey" : "id",
        "rows" : 10
    },
    "index" : {
        "index" : "solr",
        "type" : "import",
        "bulk_size" : 100,
        "max_concurrent_bulk" : 10,
        "mapping" : "",
        "settings": ""
    }
}

The fq and fl parameters can be provided as either an array or a single value.

You can provide your own mapping while creating the river, as well as the index settings, which will be used when creating the new index if needed.

The index is created when not already existing, otherwise the documents are added to the existing one with the configured name.

The documents are indexed using the bulk api. You can control the size of each bulk (default 100) and the maximum number of concurrent bulk operations (default is 10). Once the limit is reached the indexing will slow down, waiting for one of the bulk operations to finish its work; no documents will be lost.

Limitations

  • only stored fields can be retrieved from Solr, therefore indexed in elasticsearch
  • the river is not meant to keep elasticsearch in sync with Solr, but only to import data once. It’s possible to register
  • the river multiple times in order to import different sets of documents though, even from different solr instances.
  • it’s recommended to create the mapping given the existing solr schema in order to apply the correct text analysis while importing the documents. In the future there might be an option to auto generating it from the Solr schema.

Hope the tool helped, do share your feedback with us, we’re always interested to hear how it worked out for you and shout if we can help further with training or consultancy.

How to write an elasticsearch river plugin

January 10th, 2013 by
(http://blog.trifork.com/2013/01/10/how-to-write-an-elasticsearch-river-plugin/)

Up until now I told you why I think elasticsearch is so cool and how you can use it combined with Spring. It’s now time to get to something a little more technical. For example, once you have a search engine running you need to index data; when it comes to indexing data you usually need to choose between the push and the pull approach. This blog entry will detail these approaches and goes into writing a river plugin for elasticsearch.

Read the rest of this entry »

University of Amsterdam website goes live

September 11th, 2012 by
(http://blog.trifork.com/2012/09/11/university-of-amsterdam-website-goes-live/)

Congratulations to our client the University of Amsterdam who today launched their new website. With a new look & feel and a complete update of all the existing content it’s a showcase of how educational institutions can really provide relevant information to varied target audiences. The site built by us, Orange11, with the use of Hippo CMS now makes it not only easier and simpler for individual departments to update web content, it also enables the University of Amsterdam (UvA) to increasingly engage, inform and empower its students and teachers.

The CMS is not only the foundation for the main UvA website, it also supports around 100 other subsites within the university. These can be sites for different faculties but also for initiatives like Spui 25, which is a project website that is affiliated to the UvA. The total repository contains around 25.000 documents. Each is a different document type and now with this solution in place these documents can be in the shared as not only part of the repository but also in a sub-site specific section. This makes the content model very flexible especially for the numerous content editors.

Furthermore, the content in the repository is not only generated by editors, but also from content from LDAP and SAP. Besides importing content it can now also export content. This large amount of content it is all kept in sync with our advanced Solr based search solution. Solr is also what we have integrated within the site for searching the content. The search database also shows content from other sources like the StudieGids for example.

A blog on the complete search solution will hopefully follow sometime soon…but for now, take a look around the new UvA website for yourself and let us know if you have any questions.

Summer time…

September 4th, 2012 by
(http://blog.trifork.com/2012/09/04/summer-time/)

For those you may have missed our newsletter last week I’d like to take this opportunity to give you a quick lowdown of what we’ve been up to. The summer months have been far from quiet and I’m pretty excited to share in this month’s edition lots of news on projects, products & upcoming events.

Hippo & Orange11

hippo logoThe countdown has begun for the launch of The University of Amsterdam online platform. Built by Orange11 with the use of Hippo CMS the website developed with multi-platform functionality in mind is a masterpiece of technology all woven together. We’ll keep you posted about the tips & tricks we implemented.

If you can’t wait until then and want more information, contact Jan-Willem van Roekel.

Mobile Apps; just part of the service!

new motion appWe mentioned in our last newsletter the launch of Learn to write with Tracy. Well since then we’ve been working on apps for many customers including for example The New Motion, a company dedicated to the use of Electric Vehicles. Orange11 has developed an iPhone app that allows users to view or search load locations (in list or map form) and even check real-time availability of these.

ysis screenAnother example is the app for GeriMedica;Ysis Mobiel, a mobile addition to their existing Electronic Health Record database used largely in Geriatric care (also an Orange11 project). The mobile app supports specific work processes allows registered users to document (in line with the strict regulations) all patient related interaction through a simple 3-step logging process. A registration overview screen also shows the latest activities registered, which prevents co-workers from accidentally registering the same activity twice.

Visit our website for more on our mobile expertise.

Orange11 & MongoDB 

mongoDB logo

We’ve got tons of exciting things going on with MongoDB as trusted implementation partner so here are a few highlights:

Brown bag sessions

Since the launch of our brown bag sessions we’re excited that so many companies are interested to find out more this innovative open source document database. What we offer is a 60 minute slot with an Orange11 & MongoDB expert, who can educate & demonstrate MongoDB best practices & cover how it can be used in practice. It’s our sneak preview to you of the host of opportunities there are with MongoDB.

Sign up now!

Tech meeting / User Group Meeting

As a partner we’re also proud to host the next user group session on Thursday 6th September, whereby Chief Technical Director, Alvin Richards will be here to cover all the product ins & outs and share some use cases.

Don’t miss out & join usas always it’s free & pizza and cold beer on the house!

Coffee Cookies, Conversation & Customers

Last week we invited some of our customers to a brainstorm session around the new Cookie Law in the Netherlands. Together with Eric Verhelst, a lawyer specialized in the IT industry, Intellectual Property, Internet and Privacy we provided our customers with legal insight and discussed what their concerns & ideas were around solutions. If you have any questions around the new cookie law and are looking for advice, answers & solutions, contact Peter Meijer.

ElasticSearch has just got bigger

es logoCongratulations to our former CEO, Steven Schuurman who announced his new venture:ElasticSearch, the company. The company’s product “elasticsearch”, is an innovative and advanced open source distributed search engine. The combination of Steven joining forces with elasticsearch founder & originator Shay Banon and his background as co-founder of SpringSource, the company behind the popular Spring Framework, (also close to our heart at Orange11) it’s bound to be a great success. The company offers users and potential users of elasticsearch a definitive source for support, education and guidance with respect to developing, deploying and running elasticsearch in production environments. As Search remains a key focus area for Orange11, with our experience in both Solr and elasticsearch, our customers are guaranteed the best search solution available. For more info contact Bram Smeets. 

Our team is getting bigger & better

beach eventsWe’re happy to welcome Michel Vermeulen to the team this month. Michel is an experienced Project Manager and will further professionalize our agile development organization. We also have new talent starting next month, BUT there is room for more.

So if you’re a developer and wanna work on great project with a fun team (left: snapshot from our company beach event) then call Bram Smeets now.

That’s all for now folks….

There’s More Lucene in Solr than You Think!

April 11th, 2012 by
(http://blog.trifork.com/2012/04/11/theres-more-lucene-in-solr-than-you-think/)

We’ve been providing Lucene & Solr consultancy and training services for quite a few years now and it’s always interesting to see how these two technologies are perceived by different companies and their technical people. More precisely, I find it interesting how little Solr users know about Lucene and more so, how unaware they are how important it is to to know about it. A quite reoccurring pattern we notice is that companies, looking for a cheap and good search solution, hear about Solr and decide to download and play around with it a bit. This is usually done within a context of a small PoC to eliminate initial investment risks. So one or two technical people are responsible for that, they download Solr distribution, and start following the Solr tutorial that is published on the Solr website. They realize that it’s quite easy to get things up and running using the examples Solr ships with and very quickly decide that this is the right way to go. So what the do next? They take their PoC codebase (including all Solr configurations) and slightly modify and extend them, just to support their real systems, and in no time, they get to the point were Solr can index all the data and then serve search requests. And that’s it… they roll out with it, and very often just put this in production. It is then often the case that after a couple of weeks we get a phone call from them asking for help. And why is that?

Examples are what they are – Just examples

I always argued that the examples that are bundled in the Solr distribution serve as a double edge sword. On one hand, they can be very useful just to showcase how Solr can work and provide good reference to the different setups it can have. On the other hand, it gives this false sense of security that if the examples configuration are good enough for the examples, they’ll be good enough for the other systems in production as well. In reality, this is of course far from being the case. The examples are just what they are – examples. It’s most likely that they are far from anything you’d need to support your search requirements. Take the Solr schema for example, this is one of the most important configuration files in Solr which contributes many of the factors that will influence the search quality. Sure, there are certain field types which you probably can always use (the primitive types), but when it comes to text fields and text analysis process – this is something you need to look closer at and in most cases customize to your needs. Beyond that, it’s also important to understand how different fields behave in respect to the different search functionality you need. What roles (if at all) can a field play in the context of these functionalities. For some functionalities (e.g. free text search) you need the fields to be analyzed, for other (e.g. faceting) you don’t. You need to have a very clear idea of these search functionalities you want to support, and based on that, define what normal/dynamic/copy fields should be configured. The examples configurations don’t provide you this insight as they are targeting the dummy data and the examples functionality they are aimed to showcase – not yours! And it’s not just about the schema, the solrconfig.xml in the examples is also much too verbose than you actually need/want it to be. Far too many companies just use these example configurations in their production environment and I just find it a pity. Personally, I like to view these configuration files also serving as some sort of documentation for your search solution – but by keeping them in a mess, full of useless information and redundant configuration, they obviously cannot.

It’s Lucene – not Solr

One of the greater misconceptions with Solr is that it’s a product on its own and that reading the user manual (which is an overstatement for a semi-structured and messy collection of wiki pages), one can just set it up and put it in production. What people fail to realize is that Solr is essentially just a service wrapper around Lucene, and that the quality of the search solution you’re building, largely depends on it. Yeah, sure… Solr provide important additions on top of Lucene like caching and few enhanced query features (e.g. function queries and dismax query parser), but the bottom line, the most influential factors of the search quality lays deep down in the schema definition which essentially determines how Lucene will work under the hood. This obviously requires proper understanding of Lucene… there’s just no way around it! But honestly, I can’t really “blame” users for getting this wrong. If you look at the public (open and commercial) resources that companies are selling to the users, they actually promote this ignorance by presenting Solr as a “stands on its own” product. Books, public trainings, open documentations, all hardly discuss Lucene in detail and instead focus more on “how you get Solr to do X, Y, Z”. I find it quite a shame and actually quite misleading. You know what? I truly believe that the users are smart enough to understand – on their own – what parameters they should send Solr to enable faceting on a specific field…. common… these are just request parameters so let them figure these things out. Instead, I find it much more informative and important to explain to them how faceting actually works under the hood. This way they understand the impact of their actions and configurations and are not left disoriented in the dark once things don’t work as they’d hoped. For this reason actually, we designed our Solr training to incorporate a relatively large portion of Lucene introduction in it. And take it from me… our feedback clearly indicate that the users really appreciate it!

So…

There you have it… let it sink in: when downloading Solr, you’re also downloading Lucene. When configuring Solr, you’re also configuring Lucene. And if there are issues with Solr, they are often related to Lucene as well. So to really know Solr, do yourself a favor, and start getting to know Lucene! And you don’t need to be a Java developer for that, it’s not the code itself that you need to master. How Lucene works internally, on a detailed yet conceptual level should be more than enough for most users.

Faceting & result grouping

April 10th, 2012 by
(http://blog.trifork.com/2012/04/10/faceting-result-grouping/)

Result grouping and faceting are in essence two different search features. Faceting counts the number of hits for specific field values matching the current query. Result grouping groups documents together with a common property and places these documents under a group. These groups are used as the hits in the search result. Usually result grouping and faceting are used together and a lot of times the results get misunderstood.

The main reason is that when using grouping people expect that a hit is represented by a group. Faceting isn’t aware of groups and thus the computed counts represent documents and not groups. This different behaviour can be very confusion. A lot of questions on the Solr user mailing list are about this exact confusion.

In the case that result grouping is used with faceting users expect grouped facet counts. What does this mean? This means that when counting the number of matches for a specific field value the grouped faceting should check whether the group a document belongs to isn’t already counted before. This is best illustrated with some example documents.

item_id product_id product_name product_color product_size
1 1 The blue jacket DarkBlue S
2 1 The blue jacket DarkBlue M
3 1 The blue jacket DarkBlue L
4 2 The blue blouse RegularBlue S
5 2 The blue blouse RegularBlue M
6 2 The blue blouse DarkBlue L

Lets say we query for all, facet by color field and group by product_id. Use faceting as it is we would have the following facet counts:

  • DarkBlue – 4
  • RegularBlue – 2

When we would use grouped faceting we would have the following counts:

  • DarkBlue – 2
  • RegularBlue – 1

The facet counts computed by the grouped faceting is actually what most end users expect. The good news is that support for grouped faceting was recently added to Solr and Lucene and will be included in their 4.0 release. Unfortunately grouped facets are more expensive to compute than normal facets due to the fact that it needs to keep track of which groups have already been counted for a specific facet value.

Grouped facets in Solr

In Solr grouped faceting builds further on the existing faceting parameters and can just be enabled by using the following parameter as is described on the Solr wiki:
group.facet=true
When enabled all the already specified field facets (facet.field parameters) will be computed as grouped facets. Both single and multivalued field facets are supported. Other facet types like range facets aren’t supported yet.

Grouped facets in Lucene

Grouped facets are implemented as Lucene collector in the Lucene grouping module. The following code example shows how grouped facets can be used:

 boolean facetFieldMultivalued = false;
BytesRef facetPrefix = null
AbstractGroupFacetCollector groupedAirportFacetCollector = TermGroupFacetCollector.createTermGroupFacetCollector(groupField, facetField, facetFieldMultivalued, facetPrefix, 128);
searcher.search(query, groupedAirportFacetCollector); // Computing the grouped facet counts
boolean orderFacetEntriesByCount = true;
TermGroupFacetCollector.GroupedFacetResult airportResult = groupedAirportFacetCollector.mergeSegmentResults(offset + limit, minCount, orderFacetEntriesByCount);
System.out.printf("Total facet hit count" + airportResult.getTotalCount());
System.out.printf("Total facet hit missing count" + airportResult.getTotalMissingCount());
List<AbstractGroupFacetCollector.FacetEntry> facetEntries = airportResult.getFacetEntries(offset, limit);
for (AbstractGroupFacetCollector.FacetEntry facetEntry : facetEntries) {
  // render facet entries
}

As you can see in the above code sample there are a number of options that can be specified:

  • groupField – The field to group by.
  • facetField – The field to count grouped facets for.
  • facetFieldMultivalued – Whether the facetField has multiple values per document. Computing facet counts for fields with maximum one value per document is faster than computing for fields having more than one value per document.
  • facetPrefix – Count only values that start with the prefix. If the prefix is null all values are counted that match the query.
  • offset – The offset to start to include facet entries.
  • limit – The number of facet entries to include from the offset.
  • minCount – The minimum count a facet entry needs to have to be included in the facet entries.
  • orderFacetEntriesByCount – Whether to order the facet entries by count.

Not all options are required to to be used. There is also a doc values based implementation for grouped facets that is included in the grouping module. This implementation isn’t used by Solr.

As you can see it is quite easy to use grouped faceting from both Solr and Lucene. Did you try out this new feature? If so let us know how the grouped faceting is working in your Lucene app or Solr setup by posting comment!

March newsletter

March 14th, 2012 by
(http://blog.trifork.com/2012/03/14/march-newsletter/)

This month our newsletter is packed full of news and event highlights so happy reading…

Spring Special offer

springsale_banner_1.png

The sun is shining and spring is the air, and for that very reason we have launched a special offer for onsite Solr & Lucene training. Our Spring Sale offers 25% off a 2-day training offered by our own active and leading Lucene & Solr committers and contributors. The training covers firstly how the Apache Lucene engine operates and thereafter introduces Solr in detail. It also covers advance search functionalities, data indexing and last but not least performance tuning and scalability. For more information, terms & registration visit our website.

Digital assessments using the QTI delivery engine

Perhaps you read in one of our recent blog entries that we are innovating the world of digital assessments. For many working in digital assessments / examinations, the QTI standard may not be a new phenomenon; it’s been around for a while. The interesting part is how it can be used. Orange11 is currently implementing a QTI assessment delivery engine that is opening new possibilities in digital examinations, assessments, publishing & many more areas. We’re currently busy preparing an interesting demo that will be available online in the coming weeks. However, in the meantime if you want to know more about the standard and technology and how we have implemented it, just drop us a note with your contact details and we can set up a meeting or send you more information.

GOTO Amsterdam

GOTO_amsterdam_2012_960x175.png

Come on sign up. We’re already very excited and are busy preparing for the event and we anticipate that this year is going to be even bigger & better than last year. The new location of the Beurs van Berlage (Stock Exchange) building is an highlight in itself. As for the top-notch speakers they include some well-known names in the industry including Trisha Gee from LMAX & Peter Hilton from Lunatech. Our keynotes sessions also look very promising and include sessions by John-Henry Harris, from LEGO and Peter-Paul Koch covering The Future of the Mobile Web.

Registration is open and prices go up every day so don’t miss out and sign up now.

Our Apache Whirr wizard

Frank Scholten, one of our Java developers has been voted in as a committer on Apache Whirr. Whirr is a Java library for quickly setting up services in the cloud. For example, using Whirr you can start a Hadoop cluster on Amazon EC2 in 5 minutes via the whirr command-line tool or its Java API. Whirr can also be used in combination with Puppet to automatically install and configure servers.

Frank has been active in using Apache Whirr and has also contributed his insights to the community site SearchWorkings.org where he has most recently written the blog Mahout support in Whirr. We are proud of his contributions and if you have any specific Apache Whirr question let us know.

Tech meeting 5th April (Amsterdam)

This month:

– Apache HTTP: Even if this project doesn’t need an introduction anymore; to celebrate its 17th birthday (and the recently released version 2.4), we would like to invite you to a presentation of the Apache HTTP server and some of the most used modules.

– Insight into Clojure, including syntax & data structures, a common interface to rule them all: sequences, code as data for a programmable programming language (macros) and much more.

Sign up now!
Don’t worry for those not in & around Amsterdam slides available thereafter via our website.

Join our search specialists at…

bbuzzwords_logo_social_witheardate(1).pngBerlin Buzzwords. The event that focuses on scalable search, data-analysis in the cloud and NoSQL-databases. Berlin Buzzwords presents more than 30 talks and presentations of international speakers specific to the three tags “search”, “store” and “scale”. The early bird tickets are available until 20th March, so sign up now to benefit from the special discounted prices. Our own search search specialists together with many of the contributors from the community site SearchWorkings will be present.

So that ‘s all for now folks, hope you have enjoyed the update.

Berlin Buzzwords 2012

January 11th, 2012 by
(http://blog.trifork.com/2012/01/11/berlin-buzzwords-2012/)

Yes, Berlin Buzzwords is back on the 4th & 5th June 2012! This really is only conference for developers and users of open source software projects, focusing on the issues of scalable search, data-analysis in the cloud and NoSQL-databases. All the talks and presentations are specific to three tags; “search”, “store” and “scale”.

Looking back to last year, this event had a great turnout. There were well over 440 attendees, of which 130 internationals (from all over including Israel, US, UK, NL, Italy, Spain, Austria and more) and an impressive show of 48 speakers. It was a 2 day event covering 3 tracks with high quality talks, but was surrounded with 5 days of workshops, 10 evening events for attendees to mingle with locals, specialized training opportunities and these are just a few of the activities that were on offer!

What was the outcome? Well let the feedback from some of the delegates tell the story:

“Buzzwords was awesome. A lot of great technical speakers, plenty of interesting attendees and friends, lots of food and fun beer gardens in the evening. I can’t wait until next year!“

“I can’t recommend this conference enough. Top industry speakers, top developers and fantastic organization. Mark this event on your sponsoring calendar!“

“Berlin Buzzwords is by far one of the best conferences around if you care about search, distributed systems, and nosql…“

“Thanks for organizing. My goal was to learn and I learned a lot!“

So to get the ball rolling for this year the call for papers has now officially opened via the website.

You can submit talks on the following topics:

  •  IR / Search – Lucene, Solr, katta, ElasticSearch or comparable solutions
  •  NoSQL – like CouchDB, MongoDB, Jackrabbit, HBase and others
  •  Hadoop – Hadoop itself, MapReduce, Cascading or Pig and relatives

Related topics not explicitly listed above are also more than welcome I’ve been told. The requirements are for presentations on the implementation of the systems themselves, technical talks, real world applications and case studies.

What’s more this year there is once again an impressive Program Committee consisting of:

  • Isabel Drost (Nokia, Apache Mahout)
  • Jan Lehnardt (CouchBase, Apache CouchDB)
  • Simon Willnauer (SearchWorkings, Apache Lucene)
  • Grant Ingersoll (Lucid Imagination, Apache Lucene)
  • Owen O’Malley (Hortonworks Inc., Apache Hadoop)
  • Jim Webber (Neo Technology, Neo4j)
  • Sean Treadway (Soundcloud)

For more information, submission details and deadlines visit the conference website.

I am truly looking forward to this event, hope to see you there too!

Compromise is hard

November 22nd, 2011 by
(http://blog.trifork.com/2011/11/22/compromise-is-hard/)

Whenever I talk my job with friends who are also IT professionals, the most commonly desired aspect is that I get to work in a community where everybody has a voice.  Apache Software Foundation projects like Solr and Lucene tend to work from the motto that if it didn’t happen on the mailing list, it didn’t happen.  This means that no matter how experienced you are, how many years you’ve been working on a project, or how knowledgeable you are about an issue, you can always chime in with your 5c worth.  I do have to agree with my friends, this, probably beyond anything else, is what I love most about my job.

Yet this freedom doesn’t come without its downsides.  Discussions on simple issues can quickly snowball into impassioned debates or worst still, flame-wars.  In Solr and Lucene, where we prefer to come to a consensus on a issue or code change, rather than railroading them through, this often means compromises must be made in order to appease all parties.  Let me tell you, as someone who often plays the roll of a peacemaker, compromise is hard.

To understand what I mean, lets quickly examine what happens in a corporate environment when an issue arises about lets say, a product.  Generally a group of people who know about the product will be brought together to discuss how to address the issue.  Best case scenario consensus is immediately reached about how best to go forward.  Worst case, the group fragments into sub-groups with different opinions and a debate breaks out.  However even with different opinions, the sub-groups are motivated to compromise by the fact that:

  • Without agreement, the issue will be stalled, product ruined, the company potentially doomed and jobs lost. People like keeping their jobs.
  • Developing the product is what they’re paid to do
  • Management will most likely make its own decision if agreement is not reached

Compromise will be reached, even if it means some individuals having to relent entirely.

In comparison, none of these motivations exist in the discussion about an issue in Solr or Lucene.  All committers are equal, there is no management which will make a decision for the community and a single committer can veto a change if they have a valid technical reason.  Committers come from a wide variety of backgrounds and cultures.  Some are employed solely to work on the projects, some work for organisations that use the project artifacts in their own projects, some may very well be users or hobbyists who have invested some of their own time.  There is no overriding corporate entity, no community agreed strategy.  Communication is done via mailing lists and JIRA issues which provide a degree of anonymity and physical distance.

Consequently, when a committer finds themselves disagreeing with a change being suggested, there is not necessarily any reason for them to be compromising.  Assuming they have a valid technical reason for disagreeing, they can be as stubborn and unrelenting as they like.  Their livelihood will most likely be unaffected if another’s  issue doesn’t go forward.  Other issues will continue to be developed and there is very little likelihood that the projects will stall.  Trying to find a compromise or to encourage others to, in this sort of environment, can be very challenging.

Yet it is exactly this kind of environment which has lead to many successes in Solr and Lucene.  Just at that moment where all parties involved have thrown their toys from the cot and hair is being pulled, time and time again someone has stepped forward with a new idea or a new direction which all parties agree on.  Whether it be the usage of a certain design pattern, the same pattern again, or the naming of core functionality, compromises have eventually been made and agreements found.  Although I so often hate it at the time, I sometimes do wonder whether this is actually what I love most about my job.

We need you

Hopefully, in a roundabout way, I’ve shown that being part of the Solr and Lucene community, albeit challenging, is also very rewarding.  If you’re either tired of having your voice ignored, passionate about open source search, or just like being part of a lively debate, I encourage you to get involved in the community, become part of mailing list discussions and contribute to issues.

Simon says: optimize is bad for you….

November 21st, 2011 by
(http://blog.trifork.com/2011/11/21/simon-says-optimize-is-bad-for-you/)

In the upcoming Apache Lucene 3.5 release we deprecated an old and long standing method on the IndexWriter. Almost everyone who has ever used Lucene knows, IndexWriter#optimize() – I expect a lot of users to ask why we did this, well this is one of the reasons I wrote this blog. Let me go back a couple of steps and try to first explain what optimized did and even more importantly why previous versions of Lucene actually had this option.

Lucene writes segments?

One of the principles in Lucene since day one is the write-once policy. We never write a file twice. When you add a document via IndexWriter it gets indexed into the memory and once we have reached a certain threshold (max buffered documents or RAM buffer size) we write all the documents from the main memory to disk; you can find out more about this here and here. Writing documents to disk produces an entire new index called a segment. Now, when you index a bunch of documents or you run incremental indexing in production here you can see the number of segments changing frequently.  However, once you call commit Lucene flushes its entire RAM buffer into segments, syncs them and writes pointers to all segments belonging to this commit into the SEGMENTS file.

So far so good. Since Lucene never changes files how is it updating documents? The truth is it doesn’t. In Lucene an update is just an atomic add & delete, meaning that Lucene adds the updated document to the index and marks all previous versions as deleted. Alright, but how do we get rid of deleted documents then?

Housekeeping & Per-Segment Search?

Obviously, Lucene needs to do some housekeeping here. What happens under the hood is that from time to time segments are merged into (usually) bigger segments to:

  • reduce the number of segments to be searched
  • expunge deleted documents (influences scoring due to their contribution to Document Frequency)
  • keep the number of file handles small (Lucene tends to have a fair few files open)
  • reduce disk space

All this happens in the background controlled by a configurable MergePolicy.  The MergePolicy takes care of keeping the number of segments balanced and merges them together once needed. I don’t want to go into details on merging here, which is clearly way out of scope for this post – maybe I or someone else will come back to this another time.  Yet, there is another way of forcing merges to happen, you can call  IndexWriter#optimize() which merges all existing segments into one large segment.

Optimize sounds like a very powerful operation, right? It certainly is powerful but  “if all you have is a hammer, everything looks like a nail.  Back in earlier versions of Lucene (before 2.9) Lucene treated the underlying index as one big index and reopening the IndexReader invalidated all datastructures & caches. This has changed quiet a lot towards a per-segment orientation. Almost all structures in Lucene now work on a per-segment basis which means that we only load changes or reopen instead of the entire index. As a user it still might look like one big index but once you look a little under the hood you see everything works per- segment like this IndexSearcher snippet:

 public void search(Weight weight, Filter filter, Collector collector)
      throws IOException {
    // iterate through all segment readers & execute the search
    for (int i = 0; i < subReaders.length; i++) {
      // pass the reader to the collector 
      collector.setNextReader(subReaders[i], docBase + docStarts[i]);
      final Scorer scorer = ...;
      if (scorer != null) { // score documents on this segment
        scorer.score(collector);
      }
    }
  }
Figure 1. Executing searches across segments in IndexSearcher
Each search you execute in Lucene runs on each segment in the index sequentially, unless you have an optimized index. Well, this sounds like you should optimize all the time? Wrong! Think about it again, optimizing your index will build one large segment out of your maybe 5 or 10 or N segments; this has several side-effects:
  • enormous amount of IO when merging into one big segment.
  • can take up to hours  when your index is large
  • reopen can cause unexpected memory peaks

You say this doesn’t sound that bad? Well, if you run this in production with a large index optimizing can have large impact on your system performance. Lucene’s bread and butter is the filesystem cache it uses for searching. During a merge you invalidate lots of disk-cache which is in turn not available for currently searched segments.  Once you open the index all data needs to be loaded into disk-cache again, field-caches need do be created, term-dictionaries loaded etc. and last but not least you are likely doubling the disk-space required to hold your index as old segments are still referenced while optimize is running.

Death to optimize, here comes forceMerge(1)

As I mentioned above, there is no IndexWriter#optimize() anymore in Lucene 3.5. If you do still want to explicitly invoke an optimize like merging you can use IndexWriter#forceMerge(int) where you can specify the maximum number of segments left after the merge finishes. The functionality is still there but we hope that fewer people feel like calling this together with each commit. If you use optimize extensively, think about it again and give your disks a break.