Trifork Blog

Axon Framework, DDD, Microservices

Posts Tagged ‘Apache Lucene’

Migrating Apache Solr to Elasticsearch

January 29th, 2013 by
(http://blog.trifork.com/2013/01/29/migrating-apache-solr-to-elasticsearch/)

Solar_Elasticsearch_ConcToolElasticsearch is the innovative and advanced open source distributed search engine, based on Apache Lucene. Over the past several years, at Trifork we have been doing a lot of search implementations. Driven by the fact that every other customer wanted the ‘Google-experience’ (just a text box, type some text and get relevant results) as part of their application, we started by building our own solutions on top of Apache Lucene. That worked quite well as Lucene is the defacto standard when it comes to information retrieval. But soon enough, due to Amazon, CNet and Funda in The Netherlands, people wanted to offer their users more ways to drill down into the search results by using facets. We briefly started our own (currently discontinued) open source project: FacetSearch, but quickly Solr started getting some traction and we decided to jump on that bandwagon.

Starting with Solr

So it was then we started using Solr for our projects and started to be vocal about our capabilities, that led to even more (international) Solr consultancy and training work. And as Trifork is not in the game to just use open source, but also contribute back to the community, this has led to several contributions (spatial, grouping, etc) and eventually having several committers on the Lucene (now including Solr) project.

We go back a long way…

At the same time we were well into Solr, Shay Banon, who we knew from our SpringSource days, started creating his own scalable search solution, Elasticsearch. Although, from a technical perspective a better choice for building scalable search solutions, we didn’t adopt it from the beginning. The main reason for this was that it was basically a one-man show (a veery good one at that I might add!). However, we didn’t feel comfortable recommending Elasticsearch to our customers as if Shay got hit by a bus, it would mean the end of the project. However, luckily all this changed when Shay and some of the old crew from the JTeam (the rest of JTeam is now Trifork Amsterdam) decided to join forces and launch Elasticsearch.com, the commercial company behind Elasticsearch. Now, its all systems go and what was then our main hurdle has been removed and we can use Elasticsearch and moreover guarantee continuity for the project.

Switching from Solr to Elasticsearch

Obviously we are not alone in the world and not that unique in our opinions, so we were not the only ones to change our strategy around search solutions. Many others started considering Elasticsearch, doing comparisons and eventually switching from Solr to Elasticsearch. We still regularly get requests on helping companies make the comparison. And although there are still reasons why you may want to go for Solr, in the majority of cases (especially when scalability and realtime is important) the balance more often than not goes in favor of Elasticsearch.

This is why Luca Cavanna from Trifork has written a plugin (river) for Elasticsearch that will help you migrate from your existing Solr to Elasticsearch. Basically, from Elasticsearch pulling the content from an existing Solr cluster and indexing it in Elasticsearch. Using this plugin will allow you to easily setup an Elasticsearch cluster next to your existing Solr. This will help you get up to speed quickly and therefore enables a smooth transition. Obviously, this tool is used mostly for that purpose, to help you get started. When you decide to switch to Elasticsearch permanently, you would obviously switch your indexing to directly index content from your sources to Elasticsearch. Keeping Solr in the middle is not a recommended setup.
The following description on how to use it is taken from the README.md file of the Solr to Elasticsearch river / plugin.

Getting started

First thing you need to do is: download the plugin

Then create a directory called solr-river in the plugins folder of Elasticsearch (and create it in the elasticsearch home folder, if it does not exist yet). Next, unzip and put the contents of the ZIP file (all the JAR files) in the created folder.

Configure the river

The Solr River allows to query a running Solr instance and index the returned documents in elasticsearch. It uses the Solrj library to communicate with Solr.

It’s recommended that the solrj version used is the same as the solr version installed on the server that the river is querying. The Solrj version in use and distributed with the plugin is 3.6.1. Anyway, it’s possible to query other Solr versions. The default format used is in fact javabin but you can solve compatibility issues just switching to the xml format using the wt parameter.

All the common query parameters are supported.

The solr river is not meant to keep solr and elasticsearch in sync, that’s why it automatically deletes itself on completion, so that the river doesn’t start up again at every node restart. This is the default behaviour, which can be disabled through the close_on_completion parameter.

Installation

Here is how you can easily create the river and index data from Solr, just providing the solr url and the query to execute:

curl -XPUT localhost:9200/_river/solr_river/_meta -d '
{
    "type" : "solr",
    "solr" : {
        "url" : "http://localhost:8080/solr/",
        "q" : "*:*"
    }
}'

All supported parameters are optional. The following example request contains all the parameters that are supported together with the corresponding default values applied when not present.

{
    "type" : "solr",
    "close_on_completion" : "true",
    "solr" : {
        "url" : "http://localhost:8983/solr/",
        "q" : "*:*",
        "fq" : "",
        "fl" : "",
        "wt" : "javabin",
        "qt" : "",
        "uniqueKey" : "id",
        "rows" : 10
    },
    "index" : {
        "index" : "solr",
        "type" : "import",
        "bulk_size" : 100,
        "max_concurrent_bulk" : 10,
        "mapping" : "",
        "settings": ""
    }
}

The fq and fl parameters can be provided as either an array or a single value.

You can provide your own mapping while creating the river, as well as the index settings, which will be used when creating the new index if needed.

The index is created when not already existing, otherwise the documents are added to the existing one with the configured name.

The documents are indexed using the bulk api. You can control the size of each bulk (default 100) and the maximum number of concurrent bulk operations (default is 10). Once the limit is reached the indexing will slow down, waiting for one of the bulk operations to finish its work; no documents will be lost.

Limitations

  • only stored fields can be retrieved from Solr, therefore indexed in elasticsearch
  • the river is not meant to keep elasticsearch in sync with Solr, but only to import data once. It’s possible to register
  • the river multiple times in order to import different sets of documents though, even from different solr instances.
  • it’s recommended to create the mapping given the existing solr schema in order to apply the correct text analysis while importing the documents. In the future there might be an option to auto generating it from the Solr schema.

Hope the tool helped, do share your feedback with us, we’re always interested to hear how it worked out for you and shout if we can help further with training or consultancy.

How to write an elasticsearch river plugin

January 10th, 2013 by
(http://blog.trifork.com/2013/01/10/how-to-write-an-elasticsearch-river-plugin/)

Up until now I told you why I think elasticsearch is so cool and how you can use it combined with Spring. It’s now time to get to something a little more technical. For example, once you have a search engine running you need to index data; when it comes to indexing data you usually need to choose between the push and the pull approach. This blog entry will detail these approaches and goes into writing a river plugin for elasticsearch.

Read the rest of this entry »

What’s so cool about elasticsearch?

September 25th, 2012 by
(http://blog.trifork.com/2012/09/25/whats-so-cool-about-elasticsearch/)

elasticsearch Whenever there’s a new product out there and you start using it, suggest it to customers or colleagues, you need to be prepared to answer this question: “Why should I use it?”. Well, the answer could be as simple as “Because it’s cool!”, which of course is the case with elasticsearch, but then at some point you may need to explain why. I recently had to answer the question, “So what’s so cool about elasticsearch?”, that’s why I thought it might be worthwhile sharing my own answer in this blog.

Read the rest of this entry »

Elasticsearch beyond “Big Data” – running elasticsearch embedded

September 13th, 2012 by
(http://blog.trifork.com/2012/09/13/elasticsearch-beyond-big-data-running-elasticsearch-embedded/)

elasticsearchTrifork has a long track record in doing project, training and consulting around open source search technologies. Currently we are working on several interesting search projects using elasticsearch. Elasticsearch is an open source, distributed, RESTful, search engine built on top of Apache Lucene. In contrast to for instance Apache Solr, elasticsearch is built as a highly scalable distributed system from the ground up, allowing you to shard and replicate multiple indices over a large number of nodes. This architecture makes scaling from one server to several hundreds a breeze. But, it turns out elasticsearch is not only good for what everyone calls “Big Data”, but it is also very well suited for indexing only small amounts of documents and even running elasticsearch embedded within an application, while still providing the flexibility to scale up later when needed.

Read the rest of this entry »

There’s More Lucene in Solr than You Think!

April 11th, 2012 by
(http://blog.trifork.com/2012/04/11/theres-more-lucene-in-solr-than-you-think/)

We’ve been providing Lucene & Solr consultancy and training services for quite a few years now and it’s always interesting to see how these two technologies are perceived by different companies and their technical people. More precisely, I find it interesting how little Solr users know about Lucene and more so, how unaware they are how important it is to to know about it. A quite reoccurring pattern we notice is that companies, looking for a cheap and good search solution, hear about Solr and decide to download and play around with it a bit. This is usually done within a context of a small PoC to eliminate initial investment risks. So one or two technical people are responsible for that, they download Solr distribution, and start following the Solr tutorial that is published on the Solr website. They realize that it’s quite easy to get things up and running using the examples Solr ships with and very quickly decide that this is the right way to go. So what the do next? They take their PoC codebase (including all Solr configurations) and slightly modify and extend them, just to support their real systems, and in no time, they get to the point were Solr can index all the data and then serve search requests. And that’s it… they roll out with it, and very often just put this in production. It is then often the case that after a couple of weeks we get a phone call from them asking for help. And why is that?

Examples are what they are – Just examples

I always argued that the examples that are bundled in the Solr distribution serve as a double edge sword. On one hand, they can be very useful just to showcase how Solr can work and provide good reference to the different setups it can have. On the other hand, it gives this false sense of security that if the examples configuration are good enough for the examples, they’ll be good enough for the other systems in production as well. In reality, this is of course far from being the case. The examples are just what they are – examples. It’s most likely that they are far from anything you’d need to support your search requirements. Take the Solr schema for example, this is one of the most important configuration files in Solr which contributes many of the factors that will influence the search quality. Sure, there are certain field types which you probably can always use (the primitive types), but when it comes to text fields and text analysis process – this is something you need to look closer at and in most cases customize to your needs. Beyond that, it’s also important to understand how different fields behave in respect to the different search functionality you need. What roles (if at all) can a field play in the context of these functionalities. For some functionalities (e.g. free text search) you need the fields to be analyzed, for other (e.g. faceting) you don’t. You need to have a very clear idea of these search functionalities you want to support, and based on that, define what normal/dynamic/copy fields should be configured. The examples configurations don’t provide you this insight as they are targeting the dummy data and the examples functionality they are aimed to showcase – not yours! And it’s not just about the schema, the solrconfig.xml in the examples is also much too verbose than you actually need/want it to be. Far too many companies just use these example configurations in their production environment and I just find it a pity. Personally, I like to view these configuration files also serving as some sort of documentation for your search solution – but by keeping them in a mess, full of useless information and redundant configuration, they obviously cannot.

It’s Lucene – not Solr

One of the greater misconceptions with Solr is that it’s a product on its own and that reading the user manual (which is an overstatement for a semi-structured and messy collection of wiki pages), one can just set it up and put it in production. What people fail to realize is that Solr is essentially just a service wrapper around Lucene, and that the quality of the search solution you’re building, largely depends on it. Yeah, sure… Solr provide important additions on top of Lucene like caching and few enhanced query features (e.g. function queries and dismax query parser), but the bottom line, the most influential factors of the search quality lays deep down in the schema definition which essentially determines how Lucene will work under the hood. This obviously requires proper understanding of Lucene… there’s just no way around it! But honestly, I can’t really “blame” users for getting this wrong. If you look at the public (open and commercial) resources that companies are selling to the users, they actually promote this ignorance by presenting Solr as a “stands on its own” product. Books, public trainings, open documentations, all hardly discuss Lucene in detail and instead focus more on “how you get Solr to do X, Y, Z”. I find it quite a shame and actually quite misleading. You know what? I truly believe that the users are smart enough to understand – on their own – what parameters they should send Solr to enable faceting on a specific field…. common… these are just request parameters so let them figure these things out. Instead, I find it much more informative and important to explain to them how faceting actually works under the hood. This way they understand the impact of their actions and configurations and are not left disoriented in the dark once things don’t work as they’d hoped. For this reason actually, we designed our Solr training to incorporate a relatively large portion of Lucene introduction in it. And take it from me… our feedback clearly indicate that the users really appreciate it!

So…

There you have it… let it sink in: when downloading Solr, you’re also downloading Lucene. When configuring Solr, you’re also configuring Lucene. And if there are issues with Solr, they are often related to Lucene as well. So to really know Solr, do yourself a favor, and start getting to know Lucene! And you don’t need to be a Java developer for that, it’s not the code itself that you need to master. How Lucene works internally, on a detailed yet conceptual level should be more than enough for most users.

Faceting & result grouping

April 10th, 2012 by
(http://blog.trifork.com/2012/04/10/faceting-result-grouping/)

Result grouping and faceting are in essence two different search features. Faceting counts the number of hits for specific field values matching the current query. Result grouping groups documents together with a common property and places these documents under a group. These groups are used as the hits in the search result. Usually result grouping and faceting are used together and a lot of times the results get misunderstood.

The main reason is that when using grouping people expect that a hit is represented by a group. Faceting isn’t aware of groups and thus the computed counts represent documents and not groups. This different behaviour can be very confusion. A lot of questions on the Solr user mailing list are about this exact confusion.

In the case that result grouping is used with faceting users expect grouped facet counts. What does this mean? This means that when counting the number of matches for a specific field value the grouped faceting should check whether the group a document belongs to isn’t already counted before. This is best illustrated with some example documents.

item_id product_id product_name product_color product_size
1 1 The blue jacket DarkBlue S
2 1 The blue jacket DarkBlue M
3 1 The blue jacket DarkBlue L
4 2 The blue blouse RegularBlue S
5 2 The blue blouse RegularBlue M
6 2 The blue blouse DarkBlue L

Lets say we query for all, facet by color field and group by product_id. Use faceting as it is we would have the following facet counts:

  • DarkBlue – 4
  • RegularBlue – 2

When we would use grouped faceting we would have the following counts:

  • DarkBlue – 2
  • RegularBlue – 1

The facet counts computed by the grouped faceting is actually what most end users expect. The good news is that support for grouped faceting was recently added to Solr and Lucene and will be included in their 4.0 release. Unfortunately grouped facets are more expensive to compute than normal facets due to the fact that it needs to keep track of which groups have already been counted for a specific facet value.

Grouped facets in Solr

In Solr grouped faceting builds further on the existing faceting parameters and can just be enabled by using the following parameter as is described on the Solr wiki:
group.facet=true
When enabled all the already specified field facets (facet.field parameters) will be computed as grouped facets. Both single and multivalued field facets are supported. Other facet types like range facets aren’t supported yet.

Grouped facets in Lucene

Grouped facets are implemented as Lucene collector in the Lucene grouping module. The following code example shows how grouped facets can be used:

 boolean facetFieldMultivalued = false;
BytesRef facetPrefix = null
AbstractGroupFacetCollector groupedAirportFacetCollector = TermGroupFacetCollector.createTermGroupFacetCollector(groupField, facetField, facetFieldMultivalued, facetPrefix, 128);
searcher.search(query, groupedAirportFacetCollector); // Computing the grouped facet counts
boolean orderFacetEntriesByCount = true;
TermGroupFacetCollector.GroupedFacetResult airportResult = groupedAirportFacetCollector.mergeSegmentResults(offset + limit, minCount, orderFacetEntriesByCount);
System.out.printf("Total facet hit count" + airportResult.getTotalCount());
System.out.printf("Total facet hit missing count" + airportResult.getTotalMissingCount());
List<AbstractGroupFacetCollector.FacetEntry> facetEntries = airportResult.getFacetEntries(offset, limit);
for (AbstractGroupFacetCollector.FacetEntry facetEntry : facetEntries) {
  // render facet entries
}

As you can see in the above code sample there are a number of options that can be specified:

  • groupField – The field to group by.
  • facetField – The field to count grouped facets for.
  • facetFieldMultivalued – Whether the facetField has multiple values per document. Computing facet counts for fields with maximum one value per document is faster than computing for fields having more than one value per document.
  • facetPrefix – Count only values that start with the prefix. If the prefix is null all values are counted that match the query.
  • offset – The offset to start to include facet entries.
  • limit – The number of facet entries to include from the offset.
  • minCount – The minimum count a facet entry needs to have to be included in the facet entries.
  • orderFacetEntriesByCount – Whether to order the facet entries by count.

Not all options are required to to be used. There is also a doc values based implementation for grouped facets that is included in the grouping module. This implementation isn’t used by Solr.

As you can see it is quite easy to use grouped faceting from both Solr and Lucene. Did you try out this new feature? If so let us know how the grouped faceting is working in your Lucene app or Solr setup by posting comment!

Result grouping made easier

March 26th, 2012 by
(http://blog.trifork.com/2012/03/26/result-grouping-made-easier/)

Lucene has result grouping for a while now as a contrib in Lucene 3.x and as a module in the upcoming 4.0 release. In both releases the actual grouping is performed with Lucene Collectors. As a Lucene user you need to use various of these Collectors in searches. However these Collectors have many constructor arguments. So they can become quite cumbersome to use grouping in pure Lucene apps. The example below illustrates this.

Result grouping using the grouping collectors directly
TermFirstPassGroupingCollector c1 = new TermFirstPassGroupingCollector("author", groupSort, groupOffset+topNGroups);

  boolean cacheScores = true;
  double maxCacheRAMMB = 4.0;
  CachingCollector cachedCollector = CachingCollector.create(c1, cacheScores, maxCacheRAMMB);
  s.search(new TermQuery(new Term("content", searchTerm)), cachedCollector);

  Collection<SearchGroup<BytesRef>> topGroups = c1.getTopGroups(groupOffset, fillFields);

  if (topGroups == null) {
  // No groups matched
  return;
  }

  boolean getScores = true;
  boolean getMaxScores = true;
  boolean fillFields = true;
  TermSecondPassGroupingCollector c2 = new TermSecondPassGroupingCollector("author", topGroups, groupSort, docSort,
  docOffset+docsPerGroup, getScores, getMaxScores, fillFields);

  TermAllGroupsCollector allGroupsCollector = new TermAllGroupsCollector("author");
  c2 = MultiCollector.wrap(c2, allGroupsCollector);

  if (cachedCollector.isCached()) {
  // Cache fit within maxCacheRAMMB, so we can replay it:
  cachedCollector.replay(c2);
  } else {
  // Cache was too large; must re-execute query:
  s.search(new TermQuery(new Term("content", searchTerm)), c2);
  }

  TopGroups<BytesRef> groupsResult = c2.getTopGroups(docOffset);
  groupsResult = new TopGroups<BytesRef>(groupsResult, allGroupsCollector.getGroupCount());

  // Render groupsResult...

In the above example basic grouping with caching is used and also the group count is retrieved. As you can see there is quite a lot of coding involved. Recently a grouping convenience utility has been added to the Lucene grouping module to alleviate this problem. As the code example below illustrates, using the GroupingSearch utility is much easier than interacting with actual grouping collectors.

Normally the document count is returned as hit count. However in the situation where groups are being used as hit, rather than a document the document count will not work with pagination. For this reason the group count can be used the have correct pagination. The group count returns the number of unique groups matching the query. The group count can in the case be used as hit count since the individual hits are groups.

Result grouping using the GroupingSearch utility
 GroupingSearch groupingSearch = new GroupingSearch("author");
groupingSearch.setGroupSort(groupSort);
groupingSearch.setFillSortFields(fillFields);
groupingSearch.setCachingInMB(4.0, true);
groupingSearch.setAllGroups(true);
TermQuery query = new TermQuery(new Term("content", searchTerm));
TopGroups<BytesRef> result = groupingSearch.search(indexSearcher, query, groupOffset, groupLimit);
// Render groupsResult...
Integer totalGroupCount = result.totalGroupCount; // The group count if setAllGroups is set to true, otherwise this is null

The GroupingSearch utility is only added to trunk meaning that it will be released with the Lucene 4.0 release. If you can’t wait you can always use a nightly build or checkout the trunk yourself. It is important to keep in mind that the GroupingSearch utility uses the already existing grouping collectors to perform the actual grouping. The GroupingSearch utility has four different constructors for each grouping type. Grouping by indexed terms, function, doc values and doc block. The first one is used the example above. The rest is described below.

Result grouping by function
 FloatFieldSource field1 = new FloatFieldSource("field1");
FloatFieldSource field2 = new FloatFieldSource("field2");
SumFloatFunction sumFloatFunction = new SumFloatFunction(new ValueSource[]{field1, field2});
GroupingSearch groupingSearch = new GroupingSearch(sumFloatFunction, new HashMap<Object, Object>());
...
TopGroups<MutableValue> result = groupingSearch.search(searcher, query, 0, 10);

Grouping by function uses the ValueSource abstraction from the Lucene queries module, consequently the grouping module depends on the queries module. In the above example grouping is performed on the sum of field1 and field2. The group type in the result is always of type MutableValue when grouping by a function.

Result grouping by doc values
 boolean diskResident = true;
DocValues.Type docValuesType = DocValues.Type.BYTES_VAR_SORTED;
GroupingSearch groupingSearch = new GroupingSearch("author", docValuesType, diskResident);
...
TopGroups<BytesRef> result1 = groupingSearch.search(indexSearcher, query, groupOffset, groupLimit);

// grouping by var int docvalues
DocValues.Type docValuesType = DocValues.Type.VAR_INTS;
GroupingSearch groupingSearch = new GroupingSearch("author", docValuesType
...
TopGroups<Long> result2 = groupingSearch.search(indexSearcher, query, groupOffset, groupLimit);

Grouping by docvalues requires you to specify a DocValues.Type up front and whether the doc values should be read disk resident. It is important that the DocValues.Type is the same as was used when indexing the data. A different DocValues.Type can lead to different group type in the result as you can see in the above code sample. DocValues.Type.VAR_INTS results in a Long type and DocValues.Type.BYTES_VAR_SORTED in a ByteRef type.

Result grouping by doc block
 Filter lastDocInBlock = new CachingWrapperFilter(new QueryWrapperFilter(new TermQuery(new Term("groupend", "x"))));
GroupingSearch groupingSearch = new GroupingSearch(lastDocInBlock);
Query query = new TermQuery(new Term("content", "random"))
// Render groupsResult
TopGroups<?> result = groupingSearch.search(indexSearcher, query, 0, 10);
...

Grouping by doc block requires you to specify a Filter marks the last document of each block. Obviously your data has to be indexed in a block using the IndexWriter’s addDocuments(…) method.

The GroupingSearch utility class doesn’t cover all use cases yet. It only works locally meaning it doesn’t help you with distributed grouping also it lacks a few features like grouped facets. I think this utility class is good start to make use of result grouping a bit easier than it was before. Work on making result grouping easier to use for pure Lucene apps hasn’t finished and features like distributed grouping will be made easier to use.

Have you used result grouping in your search solution either directly with Lucene or via Solr? Adding result grouping to your search solution on a large scale can be challenging! Let us know how you solved your requirements with result grouping by posting a comment.

March newsletter

March 14th, 2012 by
(http://blog.trifork.com/2012/03/14/march-newsletter/)

This month our newsletter is packed full of news and event highlights so happy reading…

Spring Special offer

springsale_banner_1.png

The sun is shining and spring is the air, and for that very reason we have launched a special offer for onsite Solr & Lucene training. Our Spring Sale offers 25% off a 2-day training offered by our own active and leading Lucene & Solr committers and contributors. The training covers firstly how the Apache Lucene engine operates and thereafter introduces Solr in detail. It also covers advance search functionalities, data indexing and last but not least performance tuning and scalability. For more information, terms & registration visit our website.

Digital assessments using the QTI delivery engine

Perhaps you read in one of our recent blog entries that we are innovating the world of digital assessments. For many working in digital assessments / examinations, the QTI standard may not be a new phenomenon; it’s been around for a while. The interesting part is how it can be used. Orange11 is currently implementing a QTI assessment delivery engine that is opening new possibilities in digital examinations, assessments, publishing & many more areas. We’re currently busy preparing an interesting demo that will be available online in the coming weeks. However, in the meantime if you want to know more about the standard and technology and how we have implemented it, just drop us a note with your contact details and we can set up a meeting or send you more information.

GOTO Amsterdam

GOTO_amsterdam_2012_960x175.png

Come on sign up. We’re already very excited and are busy preparing for the event and we anticipate that this year is going to be even bigger & better than last year. The new location of the Beurs van Berlage (Stock Exchange) building is an highlight in itself. As for the top-notch speakers they include some well-known names in the industry including Trisha Gee from LMAX & Peter Hilton from Lunatech. Our keynotes sessions also look very promising and include sessions by John-Henry Harris, from LEGO and Peter-Paul Koch covering The Future of the Mobile Web.

Registration is open and prices go up every day so don’t miss out and sign up now.

Our Apache Whirr wizard

Frank Scholten, one of our Java developers has been voted in as a committer on Apache Whirr. Whirr is a Java library for quickly setting up services in the cloud. For example, using Whirr you can start a Hadoop cluster on Amazon EC2 in 5 minutes via the whirr command-line tool or its Java API. Whirr can also be used in combination with Puppet to automatically install and configure servers.

Frank has been active in using Apache Whirr and has also contributed his insights to the community site SearchWorkings.org where he has most recently written the blog Mahout support in Whirr. We are proud of his contributions and if you have any specific Apache Whirr question let us know.

Tech meeting 5th April (Amsterdam)

This month:

– Apache HTTP: Even if this project doesn’t need an introduction anymore; to celebrate its 17th birthday (and the recently released version 2.4), we would like to invite you to a presentation of the Apache HTTP server and some of the most used modules.

– Insight into Clojure, including syntax & data structures, a common interface to rule them all: sequences, code as data for a programmable programming language (macros) and much more.

Sign up now!
Don’t worry for those not in & around Amsterdam slides available thereafter via our website.

Join our search specialists at…

bbuzzwords_logo_social_witheardate(1).pngBerlin Buzzwords. The event that focuses on scalable search, data-analysis in the cloud and NoSQL-databases. Berlin Buzzwords presents more than 30 talks and presentations of international speakers specific to the three tags “search”, “store” and “scale”. The early bird tickets are available until 20th March, so sign up now to benefit from the special discounted prices. Our own search search specialists together with many of the contributors from the community site SearchWorkings will be present.

So that ‘s all for now folks, hope you have enjoyed the update.

Query time joining in Lucene

January 22nd, 2012 by
(http://blog.trifork.com/2012/01/22/query-time-joining-in-lucene/)

Recently query time joining has been added to the Lucene join module in the Lucene svn trunk. The query time joining will be included in the Lucene 4.0 release and there is a possibility that it will also be included in Lucene 3.6.

Lets say we have articles and comments. With the query time join you can store these entities as separate documents. Each comment and article can be updates without re-indexing large parts of your index. Even better would be to store articles in an article index and comments in a comment index! In both cases a comment would have a field containing the article identifier.

article_table

In a relational database it would look something like the image above.

Query time joining has been around in Solr for quite a while. It’s a really useful feature if you want to search with relational flavor. Prior to the query time join your index needed to be prepared in a specific way in order to search across different types of data. You could either use Lucene’s index time block join or merge your domain objects into one Lucene document. However, with the join query you can store different entities as separate documents which gives you more flexibility but comes with a runtime cost.

Query time joining in Lucene is pretty straight forward, and entirely encapsulated in JoinUtil.createJoinQuery. It requires the following arguments:

  1. fromField. The from field to join from.
  2. toField. The to field to join to.
  3. fromQuery. The query executed to collect the from terms. This is usually the user specified query.
  4. fromSearcher. The search on where the fromQuery is executed.
  5. multipleValuesPerDocument. Whether the fromField contains more than one value per document (multivalued field). If this option is set to true the from terms can be collected in a more efficient manner.

The the static join method returns a query that can be executed on an IndexSearcher to retrieve all documents that have terms in the toField that match with the collected from terms. Only the entry point for joining is exposed to the user; the actual implementation completely hidden, allowing Lucene committers to change the implementation without breaking API backwards compatibility.

The query time joining is based on indexed terms and is currently implemented as two pass search. The first pass collects all the terms from a fromField (in our case the article identifier field) that match the fromQuery. The second pass returns all documents that have matching terms in a toField (in our case the article identifier field in a comment document) to the terms collected in the first pass.

The query that is returned from the static join method can also be executed on a different IndexSearcher than the IndexSearcher used as an argument in the static join method. This flexibility allows anyone to join data from different indexes; provided that the toField does exist in that index. In our example this means the article and comment data can reside in two different indices. The article index might not change very often, but the comment index might. This allows you to fine tune these indexes specific to each needs.

Lets see how one can use the query time joining! Assuming the we have indexed the content that is shown in the image above, we can now use the query time joining. Lets search for the comments that have ‘byte norms’ as article title:

 IndexSearcher articleSearcher = ...
IndexSearcher commentSearcher = ...
String fromField = "id";
boolean multipleValuesPerDocument = false;
String toField = "article_id";
// This query should yield article with id 2 as result
BooleanQuery fromQuery = new BooleanQuery();
fromQuery.add(new TermQuery(new Term("title", "byte")), BooleanClause.Occur.MUST);
fromQuery.add(new TermQuery(new Term("title", "norms")), BooleanClause.Occur.MUST);
Query joinQuery = JoinUtil.createJoinQuery(fromField, multipleValuesPerDocument, toField, fromQuery, articleSearcher);
TopDocs topDocs = commentSearcher.search(joinQuery, 10);

If you would run the above code snippet the topDocs would contain one hit. This hit would referer to the Lucene id of the comment which has value 1 in the field with name “id”. Instead of seeing the article as result you the comment that matches with the article that matches the user’s query.

You could also change the example and give all articles that match with a certain comment query. In this example the multipleValuesPerDocument is set to false and the fromField  (the id field) only contains one value per document. However, the example would still work if multipleValuesPerDocument  variable were set to true, but it would then work in a less efficient manner.

The query time joining isn’t finished yet. There is still work todo and we encourage you to help with!

  1. Query time joining that uses doc values instead of the terms in the index. During text analysis the original text is in many cases changed. It might happen that your id is omitted or modified before it is added to the index. As you might expect this can result in unexpected behaviour during searching. A commonn work-around is to add an extra field to your index that doesn’t do text analysis. However this just adds a logical field that doesn’t actually adds meaning to your index. With docvalues you wouldn’t have an extra logical field and values are analysed.
  2. More sophisticated caching. Currently not much caching happens. Documents that are frequently joined, because the fromTerm is hit often, aren’t cached at all.

Query time joining is quite straight forward it use and provides a solution the search through relational data. As described there are other ways of performing this. How did you solve your relation requirements in your Lucene based search solution? Let us know and share your experiences and approaches!

Berlin Buzzwords 2012

January 11th, 2012 by
(http://blog.trifork.com/2012/01/11/berlin-buzzwords-2012/)

Yes, Berlin Buzzwords is back on the 4th & 5th June 2012! This really is only conference for developers and users of open source software projects, focusing on the issues of scalable search, data-analysis in the cloud and NoSQL-databases. All the talks and presentations are specific to three tags; “search”, “store” and “scale”.

Looking back to last year, this event had a great turnout. There were well over 440 attendees, of which 130 internationals (from all over including Israel, US, UK, NL, Italy, Spain, Austria and more) and an impressive show of 48 speakers. It was a 2 day event covering 3 tracks with high quality talks, but was surrounded with 5 days of workshops, 10 evening events for attendees to mingle with locals, specialized training opportunities and these are just a few of the activities that were on offer!

What was the outcome? Well let the feedback from some of the delegates tell the story:

“Buzzwords was awesome. A lot of great technical speakers, plenty of interesting attendees and friends, lots of food and fun beer gardens in the evening. I can’t wait until next year!“

“I can’t recommend this conference enough. Top industry speakers, top developers and fantastic organization. Mark this event on your sponsoring calendar!“

“Berlin Buzzwords is by far one of the best conferences around if you care about search, distributed systems, and nosql…“

“Thanks for organizing. My goal was to learn and I learned a lot!“

So to get the ball rolling for this year the call for papers has now officially opened via the website.

You can submit talks on the following topics:

  •  IR / Search – Lucene, Solr, katta, ElasticSearch or comparable solutions
  •  NoSQL – like CouchDB, MongoDB, Jackrabbit, HBase and others
  •  Hadoop – Hadoop itself, MapReduce, Cascading or Pig and relatives

Related topics not explicitly listed above are also more than welcome I’ve been told. The requirements are for presentations on the implementation of the systems themselves, technical talks, real world applications and case studies.

What’s more this year there is once again an impressive Program Committee consisting of:

  • Isabel Drost (Nokia, Apache Mahout)
  • Jan Lehnardt (CouchBase, Apache CouchDB)
  • Simon Willnauer (SearchWorkings, Apache Lucene)
  • Grant Ingersoll (Lucid Imagination, Apache Lucene)
  • Owen O’Malley (Hortonworks Inc., Apache Hadoop)
  • Jim Webber (Neo Technology, Neo4j)
  • Sean Treadway (Soundcloud)

For more information, submission details and deadlines visit the conference website.

I am truly looking forward to this event, hope to see you there too!