Trifork Blog

Axon Framework, DDD, Microservices

Posts Tagged ‘Search’

Goodbye SearchWorkings.org

December 3rd, 2013 by
(http://blog.trifork.com/2013/12/03/goodbye-searchworkings-org/)

searchworkings_logoIn 2011 we launched SearchWorkings.org, a community website that aimed to bring search professionals together, mostly around open source search technologies like Apache Lucene and Apache Solr. At the time, the number of resources providing high value content around those technologies was limited. Therefore, we created the searchworkings portal, providing blog entries, white papers and a forum. Next to JTeam’s own search experts (Simon Willnauer, Uri Boness, Martijn van Groningen, Chris Male, Luca Cavanna and Frank Scholten), we also managed to get several external contributors onboard (Isabel Drost, Chris Mattmann, Mike McCandless, Uwe Schindler, Marc Sturlese, Anne Veling, Dawid Weiss and Karl Wright).

Read the rest of this entry »

Migrating Apache Solr to Elasticsearch

January 29th, 2013 by
(http://blog.trifork.com/2013/01/29/migrating-apache-solr-to-elasticsearch/)

Solar_Elasticsearch_ConcToolElasticsearch is the innovative and advanced open source distributed search engine, based on Apache Lucene. Over the past several years, at Trifork we have been doing a lot of search implementations. Driven by the fact that every other customer wanted the ‘Google-experience’ (just a text box, type some text and get relevant results) as part of their application, we started by building our own solutions on top of Apache Lucene. That worked quite well as Lucene is the defacto standard when it comes to information retrieval. But soon enough, due to Amazon, CNet and Funda in The Netherlands, people wanted to offer their users more ways to drill down into the search results by using facets. We briefly started our own (currently discontinued) open source project: FacetSearch, but quickly Solr started getting some traction and we decided to jump on that bandwagon.

Starting with Solr

So it was then we started using Solr for our projects and started to be vocal about our capabilities, that led to even more (international) Solr consultancy and training work. And as Trifork is not in the game to just use open source, but also contribute back to the community, this has led to several contributions (spatial, grouping, etc) and eventually having several committers on the Lucene (now including Solr) project.

We go back a long way…

At the same time we were well into Solr, Shay Banon, who we knew from our SpringSource days, started creating his own scalable search solution, Elasticsearch. Although, from a technical perspective a better choice for building scalable search solutions, we didn’t adopt it from the beginning. The main reason for this was that it was basically a one-man show (a veery good one at that I might add!). However, we didn’t feel comfortable recommending Elasticsearch to our customers as if Shay got hit by a bus, it would mean the end of the project. However, luckily all this changed when Shay and some of the old crew from the JTeam (the rest of JTeam is now Trifork Amsterdam) decided to join forces and launch Elasticsearch.com, the commercial company behind Elasticsearch. Now, its all systems go and what was then our main hurdle has been removed and we can use Elasticsearch and moreover guarantee continuity for the project.

Switching from Solr to Elasticsearch

Obviously we are not alone in the world and not that unique in our opinions, so we were not the only ones to change our strategy around search solutions. Many others started considering Elasticsearch, doing comparisons and eventually switching from Solr to Elasticsearch. We still regularly get requests on helping companies make the comparison. And although there are still reasons why you may want to go for Solr, in the majority of cases (especially when scalability and realtime is important) the balance more often than not goes in favor of Elasticsearch.

This is why Luca Cavanna from Trifork has written a plugin (river) for Elasticsearch that will help you migrate from your existing Solr to Elasticsearch. Basically, from Elasticsearch pulling the content from an existing Solr cluster and indexing it in Elasticsearch. Using this plugin will allow you to easily setup an Elasticsearch cluster next to your existing Solr. This will help you get up to speed quickly and therefore enables a smooth transition. Obviously, this tool is used mostly for that purpose, to help you get started. When you decide to switch to Elasticsearch permanently, you would obviously switch your indexing to directly index content from your sources to Elasticsearch. Keeping Solr in the middle is not a recommended setup.
The following description on how to use it is taken from the README.md file of the Solr to Elasticsearch river / plugin.

Getting started

First thing you need to do is: download the plugin

Then create a directory called solr-river in the plugins folder of Elasticsearch (and create it in the elasticsearch home folder, if it does not exist yet). Next, unzip and put the contents of the ZIP file (all the JAR files) in the created folder.

Configure the river

The Solr River allows to query a running Solr instance and index the returned documents in elasticsearch. It uses the Solrj library to communicate with Solr.

It’s recommended that the solrj version used is the same as the solr version installed on the server that the river is querying. The Solrj version in use and distributed with the plugin is 3.6.1. Anyway, it’s possible to query other Solr versions. The default format used is in fact javabin but you can solve compatibility issues just switching to the xml format using the wt parameter.

All the common query parameters are supported.

The solr river is not meant to keep solr and elasticsearch in sync, that’s why it automatically deletes itself on completion, so that the river doesn’t start up again at every node restart. This is the default behaviour, which can be disabled through the close_on_completion parameter.

Installation

Here is how you can easily create the river and index data from Solr, just providing the solr url and the query to execute:

curl -XPUT localhost:9200/_river/solr_river/_meta -d '
{
    "type" : "solr",
    "solr" : {
        "url" : "http://localhost:8080/solr/",
        "q" : "*:*"
    }
}'

All supported parameters are optional. The following example request contains all the parameters that are supported together with the corresponding default values applied when not present.

{
    "type" : "solr",
    "close_on_completion" : "true",
    "solr" : {
        "url" : "http://localhost:8983/solr/",
        "q" : "*:*",
        "fq" : "",
        "fl" : "",
        "wt" : "javabin",
        "qt" : "",
        "uniqueKey" : "id",
        "rows" : 10
    },
    "index" : {
        "index" : "solr",
        "type" : "import",
        "bulk_size" : 100,
        "max_concurrent_bulk" : 10,
        "mapping" : "",
        "settings": ""
    }
}

The fq and fl parameters can be provided as either an array or a single value.

You can provide your own mapping while creating the river, as well as the index settings, which will be used when creating the new index if needed.

The index is created when not already existing, otherwise the documents are added to the existing one with the configured name.

The documents are indexed using the bulk api. You can control the size of each bulk (default 100) and the maximum number of concurrent bulk operations (default is 10). Once the limit is reached the indexing will slow down, waiting for one of the bulk operations to finish its work; no documents will be lost.

Limitations

  • only stored fields can be retrieved from Solr, therefore indexed in elasticsearch
  • the river is not meant to keep elasticsearch in sync with Solr, but only to import data once. It’s possible to register
  • the river multiple times in order to import different sets of documents though, even from different solr instances.
  • it’s recommended to create the mapping given the existing solr schema in order to apply the correct text analysis while importing the documents. In the future there might be an option to auto generating it from the Solr schema.

Hope the tool helped, do share your feedback with us, we’re always interested to hear how it worked out for you and shout if we can help further with training or consultancy.

Summer time…

September 4th, 2012 by
(http://blog.trifork.com/2012/09/04/summer-time/)

For those you may have missed our newsletter last week I’d like to take this opportunity to give you a quick lowdown of what we’ve been up to. The summer months have been far from quiet and I’m pretty excited to share in this month’s edition lots of news on projects, products & upcoming events.

Hippo & Orange11

hippo logoThe countdown has begun for the launch of The University of Amsterdam online platform. Built by Orange11 with the use of Hippo CMS the website developed with multi-platform functionality in mind is a masterpiece of technology all woven together. We’ll keep you posted about the tips & tricks we implemented.

If you can’t wait until then and want more information, contact Jan-Willem van Roekel.

Mobile Apps; just part of the service!

new motion appWe mentioned in our last newsletter the launch of Learn to write with Tracy. Well since then we’ve been working on apps for many customers including for example The New Motion, a company dedicated to the use of Electric Vehicles. Orange11 has developed an iPhone app that allows users to view or search load locations (in list or map form) and even check real-time availability of these.

ysis screenAnother example is the app for GeriMedica;Ysis Mobiel, a mobile addition to their existing Electronic Health Record database used largely in Geriatric care (also an Orange11 project). The mobile app supports specific work processes allows registered users to document (in line with the strict regulations) all patient related interaction through a simple 3-step logging process. A registration overview screen also shows the latest activities registered, which prevents co-workers from accidentally registering the same activity twice.

Visit our website for more on our mobile expertise.

Orange11 & MongoDB 

mongoDB logo

We’ve got tons of exciting things going on with MongoDB as trusted implementation partner so here are a few highlights:

Brown bag sessions

Since the launch of our brown bag sessions we’re excited that so many companies are interested to find out more this innovative open source document database. What we offer is a 60 minute slot with an Orange11 & MongoDB expert, who can educate & demonstrate MongoDB best practices & cover how it can be used in practice. It’s our sneak preview to you of the host of opportunities there are with MongoDB.

Sign up now!

Tech meeting / User Group Meeting

As a partner we’re also proud to host the next user group session on Thursday 6th September, whereby Chief Technical Director, Alvin Richards will be here to cover all the product ins & outs and share some use cases.

Don’t miss out & join usas always it’s free & pizza and cold beer on the house!

Coffee Cookies, Conversation & Customers

Last week we invited some of our customers to a brainstorm session around the new Cookie Law in the Netherlands. Together with Eric Verhelst, a lawyer specialized in the IT industry, Intellectual Property, Internet and Privacy we provided our customers with legal insight and discussed what their concerns & ideas were around solutions. If you have any questions around the new cookie law and are looking for advice, answers & solutions, contact Peter Meijer.

ElasticSearch has just got bigger

es logoCongratulations to our former CEO, Steven Schuurman who announced his new venture:ElasticSearch, the company. The company’s product “elasticsearch”, is an innovative and advanced open source distributed search engine. The combination of Steven joining forces with elasticsearch founder & originator Shay Banon and his background as co-founder of SpringSource, the company behind the popular Spring Framework, (also close to our heart at Orange11) it’s bound to be a great success. The company offers users and potential users of elasticsearch a definitive source for support, education and guidance with respect to developing, deploying and running elasticsearch in production environments. As Search remains a key focus area for Orange11, with our experience in both Solr and elasticsearch, our customers are guaranteed the best search solution available. For more info contact Bram Smeets. 

Our team is getting bigger & better

beach eventsWe’re happy to welcome Michel Vermeulen to the team this month. Michel is an experienced Project Manager and will further professionalize our agile development organization. We also have new talent starting next month, BUT there is room for more.

So if you’re a developer and wanna work on great project with a fun team (left: snapshot from our company beach event) then call Bram Smeets now.

That’s all for now folks….

There’s More Lucene in Solr than You Think!

April 11th, 2012 by
(http://blog.trifork.com/2012/04/11/theres-more-lucene-in-solr-than-you-think/)

We’ve been providing Lucene & Solr consultancy and training services for quite a few years now and it’s always interesting to see how these two technologies are perceived by different companies and their technical people. More precisely, I find it interesting how little Solr users know about Lucene and more so, how unaware they are how important it is to to know about it. A quite reoccurring pattern we notice is that companies, looking for a cheap and good search solution, hear about Solr and decide to download and play around with it a bit. This is usually done within a context of a small PoC to eliminate initial investment risks. So one or two technical people are responsible for that, they download Solr distribution, and start following the Solr tutorial that is published on the Solr website. They realize that it’s quite easy to get things up and running using the examples Solr ships with and very quickly decide that this is the right way to go. So what the do next? They take their PoC codebase (including all Solr configurations) and slightly modify and extend them, just to support their real systems, and in no time, they get to the point were Solr can index all the data and then serve search requests. And that’s it… they roll out with it, and very often just put this in production. It is then often the case that after a couple of weeks we get a phone call from them asking for help. And why is that?

Examples are what they are – Just examples

I always argued that the examples that are bundled in the Solr distribution serve as a double edge sword. On one hand, they can be very useful just to showcase how Solr can work and provide good reference to the different setups it can have. On the other hand, it gives this false sense of security that if the examples configuration are good enough for the examples, they’ll be good enough for the other systems in production as well. In reality, this is of course far from being the case. The examples are just what they are – examples. It’s most likely that they are far from anything you’d need to support your search requirements. Take the Solr schema for example, this is one of the most important configuration files in Solr which contributes many of the factors that will influence the search quality. Sure, there are certain field types which you probably can always use (the primitive types), but when it comes to text fields and text analysis process – this is something you need to look closer at and in most cases customize to your needs. Beyond that, it’s also important to understand how different fields behave in respect to the different search functionality you need. What roles (if at all) can a field play in the context of these functionalities. For some functionalities (e.g. free text search) you need the fields to be analyzed, for other (e.g. faceting) you don’t. You need to have a very clear idea of these search functionalities you want to support, and based on that, define what normal/dynamic/copy fields should be configured. The examples configurations don’t provide you this insight as they are targeting the dummy data and the examples functionality they are aimed to showcase – not yours! And it’s not just about the schema, the solrconfig.xml in the examples is also much too verbose than you actually need/want it to be. Far too many companies just use these example configurations in their production environment and I just find it a pity. Personally, I like to view these configuration files also serving as some sort of documentation for your search solution – but by keeping them in a mess, full of useless information and redundant configuration, they obviously cannot.

It’s Lucene – not Solr

One of the greater misconceptions with Solr is that it’s a product on its own and that reading the user manual (which is an overstatement for a semi-structured and messy collection of wiki pages), one can just set it up and put it in production. What people fail to realize is that Solr is essentially just a service wrapper around Lucene, and that the quality of the search solution you’re building, largely depends on it. Yeah, sure… Solr provide important additions on top of Lucene like caching and few enhanced query features (e.g. function queries and dismax query parser), but the bottom line, the most influential factors of the search quality lays deep down in the schema definition which essentially determines how Lucene will work under the hood. This obviously requires proper understanding of Lucene… there’s just no way around it! But honestly, I can’t really “blame” users for getting this wrong. If you look at the public (open and commercial) resources that companies are selling to the users, they actually promote this ignorance by presenting Solr as a “stands on its own” product. Books, public trainings, open documentations, all hardly discuss Lucene in detail and instead focus more on “how you get Solr to do X, Y, Z”. I find it quite a shame and actually quite misleading. You know what? I truly believe that the users are smart enough to understand – on their own – what parameters they should send Solr to enable faceting on a specific field…. common… these are just request parameters so let them figure these things out. Instead, I find it much more informative and important to explain to them how faceting actually works under the hood. This way they understand the impact of their actions and configurations and are not left disoriented in the dark once things don’t work as they’d hoped. For this reason actually, we designed our Solr training to incorporate a relatively large portion of Lucene introduction in it. And take it from me… our feedback clearly indicate that the users really appreciate it!

So…

There you have it… let it sink in: when downloading Solr, you’re also downloading Lucene. When configuring Solr, you’re also configuring Lucene. And if there are issues with Solr, they are often related to Lucene as well. So to really know Solr, do yourself a favor, and start getting to know Lucene! And you don’t need to be a Java developer for that, it’s not the code itself that you need to master. How Lucene works internally, on a detailed yet conceptual level should be more than enough for most users.

April Newsletter

April 4th, 2012 by
(http://blog.trifork.com/2012/04/04/april-newsletter/)

Spring is here and hopefully the longer days mean we can pack them full of great things to do in work & play! This month’s issue is a quick news flash on some things we have planned & on offer for the coming days, weeks & months and hopefully you can join us at some of these events.

2 days to go…

…until our monthly (free) Tech Meeting which is on Thursday 5th April 2012, served as always with ice cold beer & pizza. On the program this month are the following sessions:

– Apache HTTP: Even if this project doesn’t need an introduction anymore; to celebrate its 17th birthday (and the recently released version 2.4), we would like to invite you to a presentation of the Apache HTTP server and some of the most used modules.

– Insight into Clojure, including syntax & data structures, a common interface to rule them all: sequences, code as data for a programmable programming language (macros) and much more.

Sign up here.

ElasticSearch at GOTO night

elastic search logoOn April 19th Orange11 & Trifork will host yet another GOTO night at Pakhuis de Zwijger. Our last event was a great success and attracted over 60 attendees and the feedback was very positive.

This time we hope for just as much interest especially seen as we are lucky to line up Shay Banon founder of ElasticSearch, who will host a full hands on session, no slides, driven by real life usage of ElasticSearch.

Our second speaker will be announced later this week. There are limited spaces so make sure you register your interest.

Sign up now!

Training session at Berlin Buzzwords

bbuzzwords_logo_social_witheardate(1).pngBerlin Buzzwords is the event that focuses on scalable search, data-analysis in the cloud and NoSQL-databases. With more than 30 talks and presentations of international speakers specific to the three tags “search”, “store” and “scale” this year registrations are storming in.

Once again this year we will offer training opportunities the two days following the event (6th & 7th June). On popular demand we will host a special Lucene & Solr training in a location very close to Urania in Berlin.

For all Berlin Buzzwords delegates we offer a EUR 300 discount so for more information & registration check out our website now. Discount code berlinbuzzwordsvip.

GOTO discount for Orange11 blog readers

GOTO_amsterdam_2012_960x175.png

The GOTO event this year is promised to be even bigger & better, the new location of the Beurs van Berlage (Stock Exchange) is a highlight in itself. As for the top-notch speakers they include some well-known names in the industry including Simon Brown the founder of Coding the Architecture and Greg Young co-founder and CTO of IMIS, a stock market analytics firm in Vancouver BC.

Readers that sign up now can enjoy a further EUR 75 discount off the conference price. Use the discount code orange11vip when signing up.

Also don’t forget the price goes up every day, but you can freeze the price the moment at which you show your interest.

Special training session prior to GOTO Amsterdam

The same Lucene & Solr training we offer above in Berlin will also be available in Amsterdam prior to GOTO Amsterdam. GOTO delegates can also enjoy a EUR 300 discount so for more information & registration check out our website now. Discount code gotovip.

This is a PUBLIC TRAINING so also open to non-GOTO attendees as well.

Click here for more information.

Don’t miss out on our Spring special offer

springsale_banner_1.png

We mentioned last month that we have launched a special offer for onsite Solr & Lucene training. The Spring Sale offers 25% off a 2-day training offered by our own active and leading Lucene & Solr committers and contributors. The training covers firstly how the Apache Lucene engine operates and thereafter introduces Solr in detail. It also includes advance search functionalities, data indexing and last but not least performance tuning and scalability.

It’s already proved to be very popular, but remember the offer is limited to the month of April so make sure you sign up now via our website.

Interesting reads…

So if you have any time left over after all the events, our earlier blogs here have also proved a popular read, they covered:

Using the spring-data project and the mongodb adapter specifically

Vaadin portlets with add-ons in Liferay

Spring Insight

So that ‘s all for now folks, more in the month of May.

March newsletter

March 14th, 2012 by
(http://blog.trifork.com/2012/03/14/march-newsletter/)

This month our newsletter is packed full of news and event highlights so happy reading…

Spring Special offer

springsale_banner_1.png

The sun is shining and spring is the air, and for that very reason we have launched a special offer for onsite Solr & Lucene training. Our Spring Sale offers 25% off a 2-day training offered by our own active and leading Lucene & Solr committers and contributors. The training covers firstly how the Apache Lucene engine operates and thereafter introduces Solr in detail. It also covers advance search functionalities, data indexing and last but not least performance tuning and scalability. For more information, terms & registration visit our website.

Digital assessments using the QTI delivery engine

Perhaps you read in one of our recent blog entries that we are innovating the world of digital assessments. For many working in digital assessments / examinations, the QTI standard may not be a new phenomenon; it’s been around for a while. The interesting part is how it can be used. Orange11 is currently implementing a QTI assessment delivery engine that is opening new possibilities in digital examinations, assessments, publishing & many more areas. We’re currently busy preparing an interesting demo that will be available online in the coming weeks. However, in the meantime if you want to know more about the standard and technology and how we have implemented it, just drop us a note with your contact details and we can set up a meeting or send you more information.

GOTO Amsterdam

GOTO_amsterdam_2012_960x175.png

Come on sign up. We’re already very excited and are busy preparing for the event and we anticipate that this year is going to be even bigger & better than last year. The new location of the Beurs van Berlage (Stock Exchange) building is an highlight in itself. As for the top-notch speakers they include some well-known names in the industry including Trisha Gee from LMAX & Peter Hilton from Lunatech. Our keynotes sessions also look very promising and include sessions by John-Henry Harris, from LEGO and Peter-Paul Koch covering The Future of the Mobile Web.

Registration is open and prices go up every day so don’t miss out and sign up now.

Our Apache Whirr wizard

Frank Scholten, one of our Java developers has been voted in as a committer on Apache Whirr. Whirr is a Java library for quickly setting up services in the cloud. For example, using Whirr you can start a Hadoop cluster on Amazon EC2 in 5 minutes via the whirr command-line tool or its Java API. Whirr can also be used in combination with Puppet to automatically install and configure servers.

Frank has been active in using Apache Whirr and has also contributed his insights to the community site SearchWorkings.org where he has most recently written the blog Mahout support in Whirr. We are proud of his contributions and if you have any specific Apache Whirr question let us know.

Tech meeting 5th April (Amsterdam)

This month:

– Apache HTTP: Even if this project doesn’t need an introduction anymore; to celebrate its 17th birthday (and the recently released version 2.4), we would like to invite you to a presentation of the Apache HTTP server and some of the most used modules.

– Insight into Clojure, including syntax & data structures, a common interface to rule them all: sequences, code as data for a programmable programming language (macros) and much more.

Sign up now!
Don’t worry for those not in & around Amsterdam slides available thereafter via our website.

Join our search specialists at…

bbuzzwords_logo_social_witheardate(1).pngBerlin Buzzwords. The event that focuses on scalable search, data-analysis in the cloud and NoSQL-databases. Berlin Buzzwords presents more than 30 talks and presentations of international speakers specific to the three tags “search”, “store” and “scale”. The early bird tickets are available until 20th March, so sign up now to benefit from the special discounted prices. Our own search search specialists together with many of the contributors from the community site SearchWorkings will be present.

So that ‘s all for now folks, hope you have enjoyed the update.

Query time joining in Lucene

January 22nd, 2012 by
(http://blog.trifork.com/2012/01/22/query-time-joining-in-lucene/)

Recently query time joining has been added to the Lucene join module in the Lucene svn trunk. The query time joining will be included in the Lucene 4.0 release and there is a possibility that it will also be included in Lucene 3.6.

Lets say we have articles and comments. With the query time join you can store these entities as separate documents. Each comment and article can be updates without re-indexing large parts of your index. Even better would be to store articles in an article index and comments in a comment index! In both cases a comment would have a field containing the article identifier.

article_table

In a relational database it would look something like the image above.

Query time joining has been around in Solr for quite a while. It’s a really useful feature if you want to search with relational flavor. Prior to the query time join your index needed to be prepared in a specific way in order to search across different types of data. You could either use Lucene’s index time block join or merge your domain objects into one Lucene document. However, with the join query you can store different entities as separate documents which gives you more flexibility but comes with a runtime cost.

Query time joining in Lucene is pretty straight forward, and entirely encapsulated in JoinUtil.createJoinQuery. It requires the following arguments:

  1. fromField. The from field to join from.
  2. toField. The to field to join to.
  3. fromQuery. The query executed to collect the from terms. This is usually the user specified query.
  4. fromSearcher. The search on where the fromQuery is executed.
  5. multipleValuesPerDocument. Whether the fromField contains more than one value per document (multivalued field). If this option is set to true the from terms can be collected in a more efficient manner.

The the static join method returns a query that can be executed on an IndexSearcher to retrieve all documents that have terms in the toField that match with the collected from terms. Only the entry point for joining is exposed to the user; the actual implementation completely hidden, allowing Lucene committers to change the implementation without breaking API backwards compatibility.

The query time joining is based on indexed terms and is currently implemented as two pass search. The first pass collects all the terms from a fromField (in our case the article identifier field) that match the fromQuery. The second pass returns all documents that have matching terms in a toField (in our case the article identifier field in a comment document) to the terms collected in the first pass.

The query that is returned from the static join method can also be executed on a different IndexSearcher than the IndexSearcher used as an argument in the static join method. This flexibility allows anyone to join data from different indexes; provided that the toField does exist in that index. In our example this means the article and comment data can reside in two different indices. The article index might not change very often, but the comment index might. This allows you to fine tune these indexes specific to each needs.

Lets see how one can use the query time joining! Assuming the we have indexed the content that is shown in the image above, we can now use the query time joining. Lets search for the comments that have ‘byte norms’ as article title:

 IndexSearcher articleSearcher = ...
IndexSearcher commentSearcher = ...
String fromField = "id";
boolean multipleValuesPerDocument = false;
String toField = "article_id";
// This query should yield article with id 2 as result
BooleanQuery fromQuery = new BooleanQuery();
fromQuery.add(new TermQuery(new Term("title", "byte")), BooleanClause.Occur.MUST);
fromQuery.add(new TermQuery(new Term("title", "norms")), BooleanClause.Occur.MUST);
Query joinQuery = JoinUtil.createJoinQuery(fromField, multipleValuesPerDocument, toField, fromQuery, articleSearcher);
TopDocs topDocs = commentSearcher.search(joinQuery, 10);

If you would run the above code snippet the topDocs would contain one hit. This hit would referer to the Lucene id of the comment which has value 1 in the field with name “id”. Instead of seeing the article as result you the comment that matches with the article that matches the user’s query.

You could also change the example and give all articles that match with a certain comment query. In this example the multipleValuesPerDocument is set to false and the fromField  (the id field) only contains one value per document. However, the example would still work if multipleValuesPerDocument  variable were set to true, but it would then work in a less efficient manner.

The query time joining isn’t finished yet. There is still work todo and we encourage you to help with!

  1. Query time joining that uses doc values instead of the terms in the index. During text analysis the original text is in many cases changed. It might happen that your id is omitted or modified before it is added to the index. As you might expect this can result in unexpected behaviour during searching. A commonn work-around is to add an extra field to your index that doesn’t do text analysis. However this just adds a logical field that doesn’t actually adds meaning to your index. With docvalues you wouldn’t have an extra logical field and values are analysed.
  2. More sophisticated caching. Currently not much caching happens. Documents that are frequently joined, because the fromTerm is hit often, aren’t cached at all.

Query time joining is quite straight forward it use and provides a solution the search through relational data. As described there are other ways of performing this. How did you solve your relation requirements in your Lucene based search solution? Let us know and share your experiences and approaches!

Berlin Buzzwords 2012

January 11th, 2012 by
(http://blog.trifork.com/2012/01/11/berlin-buzzwords-2012/)

Yes, Berlin Buzzwords is back on the 4th & 5th June 2012! This really is only conference for developers and users of open source software projects, focusing on the issues of scalable search, data-analysis in the cloud and NoSQL-databases. All the talks and presentations are specific to three tags; “search”, “store” and “scale”.

Looking back to last year, this event had a great turnout. There were well over 440 attendees, of which 130 internationals (from all over including Israel, US, UK, NL, Italy, Spain, Austria and more) and an impressive show of 48 speakers. It was a 2 day event covering 3 tracks with high quality talks, but was surrounded with 5 days of workshops, 10 evening events for attendees to mingle with locals, specialized training opportunities and these are just a few of the activities that were on offer!

What was the outcome? Well let the feedback from some of the delegates tell the story:

“Buzzwords was awesome. A lot of great technical speakers, plenty of interesting attendees and friends, lots of food and fun beer gardens in the evening. I can’t wait until next year!“

“I can’t recommend this conference enough. Top industry speakers, top developers and fantastic organization. Mark this event on your sponsoring calendar!“

“Berlin Buzzwords is by far one of the best conferences around if you care about search, distributed systems, and nosql…“

“Thanks for organizing. My goal was to learn and I learned a lot!“

So to get the ball rolling for this year the call for papers has now officially opened via the website.

You can submit talks on the following topics:

  •  IR / Search – Lucene, Solr, katta, ElasticSearch or comparable solutions
  •  NoSQL – like CouchDB, MongoDB, Jackrabbit, HBase and others
  •  Hadoop – Hadoop itself, MapReduce, Cascading or Pig and relatives

Related topics not explicitly listed above are also more than welcome I’ve been told. The requirements are for presentations on the implementation of the systems themselves, technical talks, real world applications and case studies.

What’s more this year there is once again an impressive Program Committee consisting of:

  • Isabel Drost (Nokia, Apache Mahout)
  • Jan Lehnardt (CouchBase, Apache CouchDB)
  • Simon Willnauer (SearchWorkings, Apache Lucene)
  • Grant Ingersoll (Lucid Imagination, Apache Lucene)
  • Owen O’Malley (Hortonworks Inc., Apache Hadoop)
  • Jim Webber (Neo Technology, Neo4j)
  • Sean Treadway (Soundcloud)

For more information, submission details and deadlines visit the conference website.

I am truly looking forward to this event, hope to see you there too!

The Lucene Sandbox

September 15th, 2011 by
(http://blog.trifork.com/2011/09/15/the-lucene-sandbox/)

Few people are aware that Apache Lucene has been part of the ASF since 2001, becoming a Top Level Project in 2005. 10 years is an eternity in IT where ideas tend to evolve in leaps and bounds. Over that 10 year period many users, contributors and committers have come and gone from Lucene, each shaping in their own way what has become the defacto Open Source Search library.  But of course good ideas from 10 years ago are not necessarily so good today.

Contribs and Module Consolidation

Before I talk about the Sandboxing process in more detail, it’s perhaps best to understand what the Lucene codebase has been like until recently and moreover, where we want to go.
For some time now, Lucene has had a section in its codebase known as contrib which is home to code that is related to Lucene but for whatever reason, has been deemed as not belonging in the core code. Examples of commonly used code that has been or remains part of the contrib are Lucene’s Highlighters, its large array of analysis tools and the flexible QueryParser framework.
One of the unfortunate features of the contrib is that code quality and algorithmic efficiency vary greatly and can go unaddressed for many years. I recently saw a class where the last SVN commit was 2005. I can assure you that many parts of Lucene have changed since then.
With Lucene 4 and the merger of Lucene and Solr’s development, the idea was put forward that the analysis tools in the contrib and from around the rest of the codebase should be consolidated together as a module along side Lucene and Solr. Once consolidated, the code would be held to a high standard comparable to that of Lucene’s core. With the success of this consolidation, it was decided that other concepts in Lucene should also be pulled together into modules. Two concepts that were immediately thrown around where Lucene’s wide variety of QueryParsers and its many exotic Query implementations.

It works but it’s kind of sandy

Consolidating Lucene’s QueryParsers was not such a problem. While they do suffer from their fair share of flaws, because they are so commonly used they are well tested and operate efficiently.  The same can not be said however, about all the exotic Query implementations.
The problem confronting the consolidation of the Query impls was not that they didn’t work. If that had been the problem then it would have been fine to remove them. The problem was that they did work, but were poorly written, documented or tested, or operated in inefficient ways. To put it bluntly, the code was not necessarily up to the standard that we expected for the new queries module.
Deleting code which exists in such a grey area would be a bold decision and could prove a mistake. It was obviously added for a reason and no doubt has some users. Some code with some effort could probably make its way to being module worthy. Therefore the idea of a sandbox for Lucene was floated. A place of lesser standards where ‘sandy’ code which is not deemed module worthy, could go so that it can continue to be used and maybe even improved, before the decision about its ultimate fate is made.

Sandy Code and The Future

So what code in Lucene is sandy? It’s very hard to say right now and is definitely subjective. Long depreciated or poorly written code are obviously signs, but what about the lack of testing? As such, only that code which is in the sandbox can be called sandy for sure. However as we continue to consolidate the various parts of Lucene into modules and slowly shutdown the contrib, it will no doubt grow further.
Do you have any thoughts on what perhaps belongs in the sandbox? Do share them with us, it would be good to hear from you.

Hotspotting Lucene With Query Specialization

September 12th, 2011 by
(http://blog.trifork.com/2011/09/12/hotspotting-lucene-with-query-specialization/)

Most Java developers probably take the technology behind HotspotTM  for granted and assume it’s best suited to optimize their code.  While they might know that choosing the best implementation of List will have a  big impact on their program’s performance, they probably shy away from worrying about the cost of virtual method calls, believing Hotspot knows best.

With most applications and in most situations this belief is well-founded and what I’m going to discuss is not trying to change the world view on Hotspot or encourage people to ‘optimize prematurely’. Rather, I want to explore the how we might aid Hotspot to produce the most optimal code for executing Lucene Querys and in the process increase query performance by over 100%.

Red Hot FunctionQuery Execution

One of the most popular Querys in Apache Solr and Lucene is FunctionQuery, which applies a function (known as a ValueSource) to every document in the index, returning the result of the function as the document’s score.  Rarely in Lucene’s Querys is logic applied to every document since Querys are responsible for matching (finding a relevant subset of documents) as well as scoring. Consequently, scoring logic is usually only applied to the matched subset. FunctionQuerys are not focused on matching, instead they manipulate the scores of documents in extensible ways.

To understand the problems this might cause, imagine you have an index with 100 million documents containing two fields which you wish to sum together and use as document scores. Even though the logic required to implement this is very simple, the fact that it will be executed 100 million times for every query means that it will glow hot – red hot.

As mentioned, FunctionQuery is designed to be extensible. It accepts implementations of ValueSource of which there are many in Lucene’s codebase. One such abstract implementation is MultiFloatFunction, itself accepting a list of ValueSources. MultiFloatFunction computes its function result by applying the logic implemented in the method func() to each of the results taken from its ValueSources. SumFloatFunction, a concrete implementation of MultiFloatFunction, adds sum logic to func().

The following UML class diagram illustrates the hierarchy of ValueSource classes necessary to add the values of two integer fields together.

Note, FieldValueSourceis a ValueSource implementation that returns the value of a field for a document. IntFieldValueSource assumes the value is an integer.

Although this seems convoluted already, in practice this example is not too bad. The code boils down to two values being read from two different arrays and added together. However, if you wanted to add two field values together and then multiply the result by a third field value, you would need to add another MultiFloatFunction implementation (ProductFloatFunction) and another IntFieldSource.  Suddenly we have a large stack of objects with many layers of virtual calls that must be executed per document per Query just to execute the formula (A + B) * C.

Specialized FunctionQuerys

In a simple benchmark, I indexed 100,000 documents containing random integer values for 3 fields.  I then assembled the ValueSource required to execute A + B and computed the time required to execute 100,000 FunctionQuerys using the ValueSource. The result, on my underpowered laptop, was an average execution time of 3ms – not bad, not bad at all. However when I repeated the same test, this time using the ValueSource for (A + B) * C, the execution time jumped up to 25ms per query – still not bad, especially given the benchmarks were not on a dedicated server.

However, I wondered if part of the increase in cost were the extra levels of virtual method calls that needed to be made. To test this, I created a simple interface called Function (shown below) and created an implementation of FunctionQuery which used Functions instead of
ValueSources.

public interface Function {
float f(int doc);
void setNextReader(IndexReader indexReader) throws IOException;
}

I then created an implement to support the formula (A + B) * C as follows:

public final class ABCFunction {
private final int[] aVals, bVals, cVals;
public float f(int doc) { return (aVals[doc] + bVals[doc]) * cVals[doc]; }
public void setNextReader(IndexReader indexReader) throws IOException {
// load arrays from FieldCache
}
}

Executed against the same index, the time per FunctionQuery dropped to just 13ms – a huge improvement.

Whats the difference? Well the above implementation does not use object composition. The only virtual method call executed is that from FunctionQuery to the Function implementation. Even though the logic is the same as that executed using the ValueSources (3 values loaded from arrays, arithmetic applied), Hotspot is able to execute the simplified code with fewer method invocations much quicker.

Extensibility and Performance

While hopefully you are impressed by the performance improvement, you’re also probably thinking “Well that’s great Chris, but this is not extensible or reusable” and you would be right.  ValueSource’s composition architecture allows us to quickly assemble a wide variety of functions while hiding away much of the complexity (some ValueSources are very complex).  They also lend themselves to be easy created while parsing a query string.

Yet it could be argued that the above code illustrates that if your search application uses a small set of known functions regularly, there is a considerable performance benefit to be gained by providing specialised implementations.  From my experience, most Solr applications use only a couple of functions, primarily composed of the rudimentary arithmetic variety.

ASTs and Executables

While toying with ValueSource and comparing it against my own ABCFunction, I couldn’t help but notice the similarity between ValueSources and ASTs, and ABCFunction and Java Bytecode. When compiling source code, most compilers create an Abstract Syntax Tree (AST) representation of the code.  ASTs are a very close representation of the original source code, but use typed objects and composition, instead of text, to represent code statements. The following is a UML diagram showing the AST javac builds to represent the statement int a = 10:

Having constructed an AST and after applying some optimizations, most compilers then generate the executable form of the source code. For Javac, this is Java Bytecode.

An assembled ValueSource is very much like an AST. It describes the function that is to be executed and as mentioned, can be very easily constructed during query string parsing. But like an AST, it perhaps isn’t the most efficient form for execution.  Instead, it needs to optimized (for a future article), and then converted to an optimal execution format. ABCFunction can be seen as an example of such a format.

So how do we get from ValueSource to our optimal Function implementation?
Well, we could generate source code and use Javac to compile it for us, or we could generate bytecode ourselves, or we could even generate native code. However we do it, we are creating the foundations for a Query Compiler.

Query Compilation and Beyond

Hopefully I’ve tickled your interest in how we might go about using code specialisation to increase the performance of Hotspot’s executions of Query code. In a future article, I will extend my explorations to the execution of all Lucene’s Querys and explain in more detail how we might go about building a Query Compiler for Lucene.