Read the rest of this entry »
So as promised here is a sequel to my previous post Introducing the elasticshell. Let's start exactly where we left off...
What about search?
We of course need to search against the created index. We can provide queries as either json documents or Java
QueryBuilders provided with the elasticsearch Java API, which are exposed to the shell as they are.
Read the rest of this entry »
A few days ago I released the first beta version of the elasticshell, a shell for elasticsearch. The idea I had was to create a command line tool that allows you to easily interact with elasticsearch.
Isn't elasticsearch easy enough already?
I really do think elasticsearch is already great and really easy to use. However, on the other hand there is quite some API available and quite some json involved too. Also, interacting with REST APIs requires a tool other than the browser to use the proper http methods and so on. There are different solutions available: some of them are generic, like curl or browser plugins, while others are elasticsearch plugins like head or sense, that you can use to send json requests and see the result, still in json format. What was missing is a command line tool, something that plays the role of the mongo shell in the elasticsearch world. That's ambitious, isn't it?
In the meantime the es2unix tool has been released by Drew, a member of the elasticsearch team. The interesting approach taken there is to hide all the json and show only text in a nice tabular format, providing an executable command that makes possible to pipe its output to other unix commands like grep, sort and awk. That's a great idea, and an even greater result I must say.
A json friendly environment
Read the rest of this entry »
At a recent Hippo meetup I gave a presentation about enterprise search. Being able to index and search your content, both in the Hippo CMS and in other sources, is of interest to many Hippo users. The presentation does not go into any Hippo specifics, but provides a brief introduction to search, Apache Lucene and concepts like an inverted index, but quickly goes into the two main enterprise (open source) search servers: Apache Solr and Elasticsearch.
Check out my slides for this talk on SlideShare:
Up until now I told you why I think elasticsearch is so cool and how you can use it combined with Spring. It’s now time to get to something a little more technical. For example, once you have a search engine running you need to index data; when it comes to indexing data you usually need to choose between the push and the pull approach. This blog entry will detail these approaches and goes into writing a river plugin for elasticsearch.
Whenever there's a new product out there and you start using it, suggest it to customers or colleagues, you need to be prepared to answer this question: “Why should I use it?”. Well, the answer could be as simple as “Because it's cool!”, which of course is the case with elasticsearch, but then at some point you may need to explain why. I recently had to answer the question, “So what's so cool about elasticsearch?”, that's why I thought it might be worthwhile sharing my own answer in this blog.
Trifork has a long track record in doing project, training and consulting around open source search technologies. Currently we are working on several interesting search projects using elasticsearch. Elasticsearch is an open source, distributed, RESTful, search engine built on top of Apache Lucene. In contrast to for instance Apache Solr, elasticsearch is built as a highly scalable distributed system from the ground up, allowing you to shard and replicate multiple indices over a large number of nodes. This architecture makes scaling from one server to several hundreds a breeze. But, it turns out elasticsearch is not only good for what everyone calls “Big Data”, but it is also very well suited for indexing only small amounts of documents and even running elasticsearch embedded within an application, while still providing the flexibility to scale up later when needed.
Nowadays almost every website has a full text search box as well as the auto suggestion feature in order to help users to find what they are looking for, by typing the least possible number of characters possible. The example below shows what this feature looks like in Google. It progressively suggests how to complete the current word and/or phrase, and corrects typo errors. That's a meaningful example which contains multi-term suggestions depending on the most popular queries, combined with spelling correction.
There are different ways to make auto complete suggestions with Solr. You can find many articles and examples on the internet, but making the right choice is not always easy. The goal of this post is compare the available options in order to identify the best solution tailored to your needs, rather than describe any one specific approach in depth.
It's common practice to make auto-suggestions based on the indexed data. In fact a user is usually looking for something that can be found within the index, that's why we'd like to show the words that are similar to the current query and at the same time relevant within the index. On the other hand, it is recommended to provide query suggestions; we can for example capture and index on a specific solr core all the user queries which return more than zero results, so we can use those information to make auto-suggestions as well. What actually matters is that we are going to make suggestions based on what's inside the index; for this purpose it's not relevant if the index contains user queries or “normal data”, the solutions we are going to consider can be applied in both cases.
Some questions before starting
In order to make the right choice you should first of all ask yourself some questions:
- Which Solr version are you working with? If we're working with an old version (1.x for example) it is worth an upgrade. If you can't upgrade you'll probably have less options to choose from, unless you're willing to manually apply some patches.
- Do you want to make single term or multiple term suggestions? You should basically decide if you want to suggest single words which can complete the word the user has partially written, or even complete sentences.
- Do you want to filter the suggestions based on the actual search? The user could have previously selected a facet entry, filtering his results to a specific subset. Every search should match with that specific context, so it is common practice to have the auto-suggestions reflect the user filters. Unfortunately some of the solutions we have available don't support any filter.
- How do you want to sort the auto-suggestions? It's important to show on top the best suggestion, and each solution you are going to explore has a different sorting option.
- Do you want to make auto-suggestions based on multivalued fields? Multivalued fields are for example commonly used for tags, since every document can have more than one tag and do you want to suggest a tag while the user is typing it.
- Do you want to make auto-suggestions based on prefix queries or even infix queries? While it's always possible to suggest words starting with a prefix, not all the solutions are able to suggest words that contain the actual query.
- What's the impact of each solution in terms of performance and index size? The answer depends on the index you're working with and needs to take into account that some solutions can increase the index size, while all of them will affect performance.
Faceting using the prefix parameter
The first option we have is available in Solr 1.2 and based on a special facet that includes only the results starting with a prefix, which the user has partially typed, making use of the facet.prefix parameter. This solution works only for single term suggestions starting with a particular prefix (not infix) and you can sort results only alphabetically or by count. It works even with multi valued fields, and is possible to apply any filter queries to have the suggestions reflecting the current context of the search.
Use of NGrams as part of the analysis chain
The second solution is available from Solr 1.3 and relies on the use of NGramFilterFactory or EdgeNGramFilterFactory as part of the analysis chain. It means you'll have a specific field which makes possible to search on it through wildcard queries, typing word fragments. Every word in the index will be split into several NGrams; you can reduce the number of NGrams (and the size of the index) by increasing the minGramSize parameter or switching to the EdgeNGramFilterFactory which works in only one direction, by default from the beginning edge of an input token. With NGramFilterFactory you can use infix and prefix queries, while with EdgeNGramFilterFactory only prefix queries. This looks like a really flexible way to make auto-suggestions since it relies on a specific field with its configurable processors chain. You can easily filter your results and have them sorted based on relevance, also using boosting and the eDisMax query parser. Furthermore, this solution is faster than the previous one. On the other hand, if we want to make auto-suggestions based on a field which contains many terms, we should consider that the index size will considerably increase since we are indexing for each term a number of terms equals to term length – minGramSize (using EdgeNGrams). This option would work even with multi valued fields, but the index size would obviously increase even more.
Use of the TermsComponent
One more solution, available from Solr 1.4, is based on the use of the TermsComponent, which provides access to the indexed terms in a field and the number of documents that match each term. This option is even faster than the previous one, you can make prefix queries using the terms.prefix parameter or infix queries using the terms.regex parameter available starting from Solr 3.1. Only single term suggestions are possible, and unfortunately you can't apply any filter. Furthermore, user queries will not be analyzed in any way; you'll have access to raw indexed data, which means you could have problems with whitespaces or case-sensitive queries, since you'll be searching directly through the indexed terms.
Use of the Suggester
Due to the limitations of the above solutions, Solr developers have worked on a new component created exactly for this task. This option is the most recent and recommended one, available since Solr 3.1 and based on the SpellCheckComponent, the same you can use to make spelling correction. What’s new is the SolrSpellChecker implementation to make suggestions, called Suggester, which actually makes use of the lucene suggest module. All has started with the SOLR-1316 issue, based on which the Suggester was created. Then the collate functionality has been improved with the SOLR-2010 issue. After that, the task has been finalized with LUCENE-3135 by backporting to the 3.x branch the lucene suggest module, which is actually used from the Solr Suggester class. This solution has its own separate index which you can automatically build on every commit. Using collation you can have multi-term suggestions. Furthermore, it is possible to use a custom dictionary instead of the index content, which makes the current solution even more flexible.
The following table contains pros and cons for each solution I mentioned above, from the slowest to the fastest one. Even if the last option is the most flexible, it requires more tuning. Of course more power means also more responsibility, so if your requirements are just single term suggestions with filtering and you don't have particular performance problems, the facet old fashioned way works perfectly out of the box.
This blog entry has hopefully shown you some ways in which you can use auto-suggestions with Solr and the related pros and cons. I hope this will help you in making the right choices from the beginning tailored to your requirements. Please do share any additional considerations I may not have covered and your experiences. Also, we're intrigued to hear how you deal with the same problems in your search applications. Leave a comment or ask a question if you have any doubt too!
The Data Import Handler is a popular method to import data into a Solr instance. It provides out of the box integration with databases, xml sources, e-mails and documents. A Solr instance often has multiple sources and the process to import data is usually expensive in terms of time and resources. Meanwhile, if you make some schema changes you will probably find you need to reindex all your data; the same happens with indexes when you want to upgrade to a Solr version without backward compatibility. We can call it "re-index bottleneck": once you've done the first data import involving all your external sources, you will never want to do it the same way again, especially on large indexes and complex systems.
Retrieving stored fields from a running Solr
An easier solution to do this is based on querying your existing Solr whereby it retrieves all its stored fields and reindexes them on a new instance. Everyone can write their own script to achieve this, but wouldn't it be useful having a functionality like this out of the box inside Solr? This is the reason why the SOLR-1499 issue was created about two years ago. The idea was to have a new
EntityProcessor which retrieves data from another Solr instance using Solrj. Recently effort has been put into getting this feature committed to Solr's dataimport contrib module. Bugs have been fixed and test coverage has been increased. Hopefully this issue will get released with Solr 3.5.
Let's give it a try ourselves!
First of all we need to setup the Solr instance which will act as data source: we can use the standard Solr example as explained here. Then we need to setup the Solr instance on which we will import the data: until the feature will be committed, we need to checkout the trunk or the 3x branch. In this example we will checkout the second one and apply the latest SOLR-1499 patch to it.
Let's run the new Solr instance through ant:
ant run-example -Dexample.solr.home=example/example-DIH/solr/ -Dexample.jetty.port=8888
The patch itself contains a new example of Data Import Handler based on the new functionality, here is the configuration fragment for the request handler (solrconfig.xml):
<requesthandler class="org.apache.solr.handler.dataimport.DataImportHandler" name="/dataimport"> <lst name="defaults"> <str name="config">solr-data-config.xml</str> </lst> </requesthandler>
Note that the default value for the clean parameter is true; it means that before every import the index will be cleaned up; this is annoying, especially if you need to import data from multiple cores in different phases. Even the default values for the commit and optimize parameters are true, hence you might want to avoid committing or optimizing after every import as long as your index is big and you use the autocommit behaviour. So, the following fragment may be useful to specify our values and lock them down:
<lst name="invariants"> <str name="clean">false</str> <str name="commit">false</str> <str name="optimize">false</str> </lst>
This is an example of solr-data-config.xml containing the basic parameters: the url of the solr instance acting as source and the query to be executed.
<dataconfig> <document> <entity name="sep" processor="SolrEntityProcessor" query="*:*" url="http://localhost:8983/solr/"> </entity> </document> </dataconfig>
The following are some of the additional parameters you can configure:
- rows: number of rows to retrieve for every query, the default is 50;
- fields: the fl parameter to retrieve a certain subset of fields;
- timeout: the timeout for each query, the default is 5 minutes.
You can also specify one or more field elements inside the entity element if you need to transform them, for example changing field names to match a different schema. In the above example the fields from the source index are the same as in the target index, so we don't have to put in any fields in addition.
This blog entry has shown you how to reindex your data without involving all your external sources. You can proceed this way if all your fields are configured as stored. We are still working on this feature, hopefully the
SolrEntityProcessor will be included on the Solr 3.5 release.