Trifork Blog

Posts Tagged ‘Enterprise Search’

Migrating Verity to Elasticsearch at Beeld & Geluid

July 9th, 2013 by
(http://blog.trifork.com/2013/07/09/migrating-verity-to-elasticsearch-at-beeld-geluid/)

logo12Nederlands Instituut voor Beeld & Geluid: Beeld & Geluid is not only the very interesting museum of media and television located in the colorful building next to the Hilversum Noord train station, but is also responsible for the archiving of all the audio-visual content of all the Dutch radio and television broadcasters. Around 800.000 hours of material is available in the Beeld & Geluid archives – and this grows every day as new programs are being broadcasted.

This blog entry describes the project Trifork Amsterdam is currently doing at Beeld & Geluid, replacing the current Verity search solution with one that is based on Elasticsearch.

Read the rest of this entry »

Migrating Apache Solr to Elasticsearch

January 29th, 2013 by
(http://blog.trifork.com/2013/01/29/migrating-apache-solr-to-elasticsearch/)

Solar_Elasticsearch_ConcToolElasticsearch is the innovative and advanced open source distributed search engine, based on Apache Lucene. Over the past several years, at Trifork we have been doing a lot of search implementations. Driven by the fact that every other customer wanted the ‘Google-experience’ (just a text box, type some text and get relevant results) as part of their application, we started by building our own solutions on top of Apache Lucene. That worked quite well as Lucene is the defacto standard when it comes to information retrieval. But soon enough, due to Amazon, CNet and Funda in The Netherlands, people wanted to offer their users more ways to drill down into the search results by using facets. We briefly started our own (currently discontinued) open source project: FacetSearch, but quickly Solr started getting some traction and we decided to jump on that bandwagon.

Starting with Solr

So it was then we started using Solr for our projects and started to be vocal about our capabilities, that led to even more (international) Solr consultancy and training work. And as Trifork is not in the game to just use open source, but also contribute back to the community, this has led to several contributions (spatial, grouping, etc) and eventually having several committers on the Lucene (now including Solr) project.

We go back a long way…

At the same time we were well into Solr, Shay Banon, who we knew from our SpringSource days, started creating his own scalable search solution, Elasticsearch. Although, from a technical perspective a better choice for building scalable search solutions, we didn’t adopt it from the beginning. The main reason for this was that it was basically a one-man show (a veery good one at that I might add!). However, we didn’t feel comfortable recommending Elasticsearch to our customers as if Shay got hit by a bus, it would mean the end of the project. However, luckily all this changed when Shay and some of the old crew from the JTeam (the rest of JTeam is now Trifork Amsterdam) decided to join forces and launch Elasticsearch.com, the commercial company behind Elasticsearch. Now, its all systems go and what was then our main hurdle has been removed and we can use Elasticsearch and moreover guarantee continuity for the project.

Switching from Solr to Elasticsearch

Obviously we are not alone in the world and not that unique in our opinions, so we were not the only ones to change our strategy around search solutions. Many others started considering Elasticsearch, doing comparisons and eventually switching from Solr to Elasticsearch. We still regularly get requests on helping companies make the comparison. And although there are still reasons why you may want to go for Solr, in the majority of cases (especially when scalability and realtime is important) the balance more often than not goes in favor of Elasticsearch.

This is why Luca Cavanna from Trifork has written a plugin (river) for Elasticsearch that will help you migrate from your existing Solr to Elasticsearch. Basically, from Elasticsearch pulling the content from an existing Solr cluster and indexing it in Elasticsearch. Using this plugin will allow you to easily setup an Elasticsearch cluster next to your existing Solr. This will help you get up to speed quickly and therefore enables a smooth transition. Obviously, this tool is used mostly for that purpose, to help you get started. When you decide to switch to Elasticsearch permanently, you would obviously switch your indexing to directly index content from your sources to Elasticsearch. Keeping Solr in the middle is not a recommended setup.
The following description on how to use it is taken from the README.md file of the Solr to Elasticsearch river / plugin.

Getting started

First thing you need to do is: download the plugin

Then create a directory called solr-river in the plugins folder of Elasticsearch (and create it in the elasticsearch home folder, if it does not exist yet). Next, unzip and put the contents of the ZIP file (all the JAR files) in the created folder.

Configure the river

The Solr River allows to query a running Solr instance and index the returned documents in elasticsearch. It uses the Solrj library to communicate with Solr.

It’s recommended that the solrj version used is the same as the solr version installed on the server that the river is querying. The Solrj version in use and distributed with the plugin is 3.6.1. Anyway, it’s possible to query other Solr versions. The default format used is in fact javabin but you can solve compatibility issues just switching to the xml format using the wt parameter.

All the common query parameters are supported.

The solr river is not meant to keep solr and elasticsearch in sync, that’s why it automatically deletes itself on completion, so that the river doesn’t start up again at every node restart. This is the default behaviour, which can be disabled through the close_on_completion parameter.

Installation

Here is how you can easily create the river and index data from Solr, just providing the solr url and the query to execute:

curl -XPUT localhost:9200/_river/solr_river/_meta -d '
{
    "type" : "solr",
    "solr" : {
        "url" : "http://localhost:8080/solr/",
        "q" : "*:*"
    }
}'

All supported parameters are optional. The following example request contains all the parameters that are supported together with the corresponding default values applied when not present.

{
    "type" : "solr",
    "close_on_completion" : "true",
    "solr" : {
        "url" : "http://localhost:8983/solr/",
        "q" : "*:*",
        "fq" : "",
        "fl" : "",
        "wt" : "javabin",
        "qt" : "",
        "uniqueKey" : "id",
        "rows" : 10
    },
    "index" : {
        "index" : "solr",
        "type" : "import",
        "bulk_size" : 100,
        "max_concurrent_bulk" : 10,
        "mapping" : "",
        "settings": ""
    }
}

The fq and fl parameters can be provided as either an array or a single value.

You can provide your own mapping while creating the river, as well as the index settings, which will be used when creating the new index if needed.

The index is created when not already existing, otherwise the documents are added to the existing one with the configured name.

The documents are indexed using the bulk api. You can control the size of each bulk (default 100) and the maximum number of concurrent bulk operations (default is 10). Once the limit is reached the indexing will slow down, waiting for one of the bulk operations to finish its work; no documents will be lost.

Limitations

  • only stored fields can be retrieved from Solr, therefore indexed in elasticsearch
  • the river is not meant to keep elasticsearch in sync with Solr, but only to import data once. It’s possible to register
  • the river multiple times in order to import different sets of documents though, even from different solr instances.
  • it’s recommended to create the mapping given the existing solr schema in order to apply the correct text analysis while importing the documents. In the future there might be an option to auto generating it from the Solr schema.

Hope the tool helped, do share your feedback with us, we’re always interested to hear how it worked out for you and shout if we can help further with training or consultancy.

There’s More Lucene in Solr than You Think!

April 11th, 2012 by
(http://blog.trifork.com/2012/04/11/theres-more-lucene-in-solr-than-you-think/)

We’ve been providing Lucene & Solr consultancy and training services for quite a few years now and it’s always interesting to see how these two technologies are perceived by different companies and their technical people. More precisely, I find it interesting how little Solr users know about Lucene and more so, how unaware they are how important it is to to know about it. A quite reoccurring pattern we notice is that companies, looking for a cheap and good search solution, hear about Solr and decide to download and play around with it a bit. This is usually done within a context of a small PoC to eliminate initial investment risks. So one or two technical people are responsible for that, they download Solr distribution, and start following the Solr tutorial that is published on the Solr website. They realize that it’s quite easy to get things up and running using the examples Solr ships with and very quickly decide that this is the right way to go. So what the do next? They take their PoC codebase (including all Solr configurations) and slightly modify and extend them, just to support their real systems, and in no time, they get to the point were Solr can index all the data and then serve search requests. And that’s it… they roll out with it, and very often just put this in production. It is then often the case that after a couple of weeks we get a phone call from them asking for help. And why is that?

Examples are what they are – Just examples

I always argued that the examples that are bundled in the Solr distribution serve as a double edge sword. On one hand, they can be very useful just to showcase how Solr can work and provide good reference to the different setups it can have. On the other hand, it gives this false sense of security that if the examples configuration are good enough for the examples, they’ll be good enough for the other systems in production as well. In reality, this is of course far from being the case. The examples are just what they are – examples. It’s most likely that they are far from anything you’d need to support your search requirements. Take the Solr schema for example, this is one of the most important configuration files in Solr which contributes many of the factors that will influence the search quality. Sure, there are certain field types which you probably can always use (the primitive types), but when it comes to text fields and text analysis process – this is something you need to look closer at and in most cases customize to your needs. Beyond that, it’s also important to understand how different fields behave in respect to the different search functionality you need. What roles (if at all) can a field play in the context of these functionalities. For some functionalities (e.g. free text search) you need the fields to be analyzed, for other (e.g. faceting) you don’t. You need to have a very clear idea of these search functionalities you want to support, and based on that, define what normal/dynamic/copy fields should be configured. The examples configurations don’t provide you this insight as they are targeting the dummy data and the examples functionality they are aimed to showcase – not yours! And it’s not just about the schema, the solrconfig.xml in the examples is also much too verbose than you actually need/want it to be. Far too many companies just use these example configurations in their production environment and I just find it a pity. Personally, I like to view these configuration files also serving as some sort of documentation for your search solution – but by keeping them in a mess, full of useless information and redundant configuration, they obviously cannot.

It’s Lucene – not Solr

One of the greater misconceptions with Solr is that it’s a product on its own and that reading the user manual (which is an overstatement for a semi-structured and messy collection of wiki pages), one can just set it up and put it in production. What people fail to realize is that Solr is essentially just a service wrapper around Lucene, and that the quality of the search solution you’re building, largely depends on it. Yeah, sure… Solr provide important additions on top of Lucene like caching and few enhanced query features (e.g. function queries and dismax query parser), but the bottom line, the most influential factors of the search quality lays deep down in the schema definition which essentially determines how Lucene will work under the hood. This obviously requires proper understanding of Lucene… there’s just no way around it! But honestly, I can’t really “blame” users for getting this wrong. If you look at the public (open and commercial) resources that companies are selling to the users, they actually promote this ignorance by presenting Solr as a “stands on its own” product. Books, public trainings, open documentations, all hardly discuss Lucene in detail and instead focus more on “how you get Solr to do X, Y, Z”. I find it quite a shame and actually quite misleading. You know what? I truly believe that the users are smart enough to understand – on their own – what parameters they should send Solr to enable faceting on a specific field…. common… these are just request parameters so let them figure these things out. Instead, I find it much more informative and important to explain to them how faceting actually works under the hood. This way they understand the impact of their actions and configurations and are not left disoriented in the dark once things don’t work as they’d hoped. For this reason actually, we designed our Solr training to incorporate a relatively large portion of Lucene introduction in it. And take it from me… our feedback clearly indicate that the users really appreciate it!

So…

There you have it… let it sink in: when downloading Solr, you’re also downloading Lucene. When configuring Solr, you’re also configuring Lucene. And if there are issues with Solr, they are often related to Lucene as well. So to really know Solr, do yourself a favor, and start getting to know Lucene! And you don’t need to be a Java developer for that, it’s not the code itself that you need to master. How Lucene works internally, on a detailed yet conceptual level should be more than enough for most users.

March newsletter

March 14th, 2012 by
(http://blog.trifork.com/2012/03/14/march-newsletter/)

This month our newsletter is packed full of news and event highlights so happy reading…

Spring Special offer

springsale_banner_1.png

The sun is shining and spring is the air, and for that very reason we have launched a special offer for onsite Solr & Lucene training. Our Spring Sale offers 25% off a 2-day training offered by our own active and leading Lucene & Solr committers and contributors. The training covers firstly how the Apache Lucene engine operates and thereafter introduces Solr in detail. It also covers advance search functionalities, data indexing and last but not least performance tuning and scalability. For more information, terms & registration visit our website.

Digital assessments using the QTI delivery engine

Perhaps you read in one of our recent blog entries that we are innovating the world of digital assessments. For many working in digital assessments / examinations, the QTI standard may not be a new phenomenon; it’s been around for a while. The interesting part is how it can be used. Orange11 is currently implementing a QTI assessment delivery engine that is opening new possibilities in digital examinations, assessments, publishing & many more areas. We’re currently busy preparing an interesting demo that will be available online in the coming weeks. However, in the meantime if you want to know more about the standard and technology and how we have implemented it, just drop us a note with your contact details and we can set up a meeting or send you more information.

GOTO Amsterdam

GOTO_amsterdam_2012_960x175.png

Come on sign up. We’re already very excited and are busy preparing for the event and we anticipate that this year is going to be even bigger & better than last year. The new location of the Beurs van Berlage (Stock Exchange) building is an highlight in itself. As for the top-notch speakers they include some well-known names in the industry including Trisha Gee from LMAX & Peter Hilton from Lunatech. Our keynotes sessions also look very promising and include sessions by John-Henry Harris, from LEGO and Peter-Paul Koch covering The Future of the Mobile Web.

Registration is open and prices go up every day so don’t miss out and sign up now.

Our Apache Whirr wizard

Frank Scholten, one of our Java developers has been voted in as a committer on Apache Whirr. Whirr is a Java library for quickly setting up services in the cloud. For example, using Whirr you can start a Hadoop cluster on Amazon EC2 in 5 minutes via the whirr command-line tool or its Java API. Whirr can also be used in combination with Puppet to automatically install and configure servers.

Frank has been active in using Apache Whirr and has also contributed his insights to the community site SearchWorkings.org where he has most recently written the blog Mahout support in Whirr. We are proud of his contributions and if you have any specific Apache Whirr question let us know.

Tech meeting 5th April (Amsterdam)

This month:

– Apache HTTP: Even if this project doesn’t need an introduction anymore; to celebrate its 17th birthday (and the recently released version 2.4), we would like to invite you to a presentation of the Apache HTTP server and some of the most used modules.

– Insight into Clojure, including syntax & data structures, a common interface to rule them all: sequences, code as data for a programmable programming language (macros) and much more.

Sign up now!
Don’t worry for those not in & around Amsterdam slides available thereafter via our website.

Join our search specialists at…

bbuzzwords_logo_social_witheardate(1).pngBerlin Buzzwords. The event that focuses on scalable search, data-analysis in the cloud and NoSQL-databases. Berlin Buzzwords presents more than 30 talks and presentations of international speakers specific to the three tags “search”, “store” and “scale”. The early bird tickets are available until 20th March, so sign up now to benefit from the special discounted prices. Our own search search specialists together with many of the contributors from the community site SearchWorkings will be present.

So that ‘s all for now folks, hope you have enjoyed the update.

Apache Lucene FlexibleScoring with IndexDocValues

November 16th, 2011 by
(http://blog.trifork.com/2011/11/16/apache-lucene-flexiblescoring-with-indexdocvalues/)

During GoogleSummerOfCode 2011 David Nemeskey, PhD student, proposed to improve Lucene’s scoring architecture and implement some state-of-the-art ranking models with the new framework. Prior to this and in all Lucene versions released so far the Vector-Space Model was tightly bound into Lucene. If you found yourself in a situation where another scoring model worked better for your usecase you basically had two choices; you either override all existing Scorers in Queries and implement your own model provided you have all the statistics available or you switch to some other search engine providing alternative models or extension points.

With Lucene 4.0 this is history! David Nemeskey and Robert Muir added an extensible API as well as index based statistics like Sum of Total Term Frequency or Sum of Document Frequency per Field to provide multiple scoring models. Lucene 4.0 comes with:

Lucene’s central scoring class Similarity has been extended to return dedicated Scorers like ExactDocScorer and SloppyDocScorer to calculate the actual score. This refactoring basically moved the actual score calculation out of the QueryScorer into a Similarity to allow implementing alternative scoring within a single method. Lucene 4.0 also comes with a new SimilarityProvider which lets you define a Similarity per field. Each field could use a slightly different similarity or incorporate additional scoring factors like IndexDocValues.

Boosting Similarity with IndexDocValues

Now that we have a selection of scoring models and the freedom to extend them we can tailor the scoring function exactly to our needs. Let’s look at a specific usecase – custom boosting. Imagine you indexed websites and calculated a pagerank but Lucene’s index-time boosting mechanism is not flexible enough for you, you could use IndexDocValues to store the page rank. First of all you need to get your data into Lucene ie. store the PageRank into a IndexDocValues field, Figure 1. shows an example.


IndexWriter writer = ...;
float pageRank = ...;
Document doc = new Document();
// add a standalone IndexDocValues field
IndexDocValuesField valuesField = new IndexDocValuesField("pageRank");
valuesField.setFloat(pageRank);
doc.add(valuesField);
doc.add(...); // add your title etc.
writer.addDocument(doc);
writer.commit();
Figure 1. Adding custom boost / score values as IndexDocValues

Once we have indexed our documents we can proceed to implement our Custom Similarity to incorporate the page rank into the document score. However, most of us won’t be in the situation that we can or want to come up with a entirely new scoring model so we are likely using one of the already existing scoring models available in Lucene. But even if we are not entirely sure which one we going to be using eventually we can already implement the PageRankSimilarity. (see Figure 2.)

public class PageRankSimilarity extends Similarity {

private final Similarity sim;

  public PageRankSimilarity(Similarity sim) {
    this.sim = sim; // wrap another similarity
  }

  @Override
  public ExactDocScorer exactDocScorer(Stats stats, String fieldName,
      AtomicReaderContext context) throws IOException {
    final ExactDocScorer sub = sim.exactDocScorer(stats, fieldName, context);
    // simply pull a IndexDocValues Source for the pageRank field
    final Source values = context.reader.docValues("pageRank").getSource();

    return new ExactDocScorer() {
      @Override
      public float score(int doc, int freq) {
        // multiply the pagerank into your score
        return (float) values.getFloat(doc) * sub.score(doc, freq);
      }
      @Override
      public Explanation explain(int doc, Explanation freq) {
        // implement explain here
      }
    };
  }
  @Override
  public byte computeNorm(FieldInvertState state) {
    return sim.computeNorm(state);
  }

  @Override
  public Stats computeStats(CollectionStatistics collectionStats,
                float queryBoost,TermStatistics... termStats) {
    return sim.computeStats(collectionStats, queryBoost, termStats);
  }
}
Figure 2. Custom Similarity delegate using IndexDocValues

With most calls delegated to some other Similarity of your choice, boosting documents by PageRank is as simple as it gets. All you need to do is to pull a Source from the IndexReader passed in via AtomicReaderContext (Atomic in this context means is a leave reader in the Lucene IndexReader hierarchy also referred to as a SegmentReader). The IndexDocValues#getSource() method will load the values for this field atomically on the first request and buffer them in memory until the reader goes out of scope (or until you manually unload them, I might cover that in a different post). Make sure you don’t use IndexDocValues#load() which will pull in the values for each invocation.

Can I use this in Apache Solr?

Apache Solr lets you already define custom similarities in its schema.xml file. Inside the <type> section you can define a custom similarity per <fieldType> as show in Figure 3 below.


<fieldType name="text" class="solr.TextField">
  <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
  <similarity class="solr.BM25SimilarityFactory">
    <float name="k1">1.2</float>
    <float name="b">0.76</float>
  </similarity>
</fieldType>
Figure 3. Using BM25 Scoring Model in Solr

Unfortunately, IndexDocValues are not yet exposed in Solr. There is an issue open aiming to add support for it without any progress yet. If you feel like you can benefit from IndexDocValues and all its features and you want to get involved into Apache Lucene & Solr feel free to comment on the issue. I’d be delighted to help you working towards IndexDocValues support in Solr!

What is next?

I didn’t decide on what is next in this series of posts but its likely yet another use case for IndexDocValues like Grouping and Sorting or we are going to look closer into how IndexDocValues are integrated into Lucene’s Flexible Indexing.

IndexDocValues – their applications

November 14th, 2011 by
(http://blog.trifork.com/2011/11/14/indexdocvalues-their-applications/)

From a user’s perspective Lucene’s IndexDocValues is a bunch of values per document. Unlike Stored Fields or FieldCache, the IndexDocValues’ values can be retrieved quickly and efficiently as Simon Willnauer describes in his first IndexDocValues blog post. There are many applications that can benefit from using IndexDocValues for search functionality like flexible scoring, faceting, sorting, joining and result grouping.

In general when you have functionality like sorting, faceting and result grouping you usually need an extra dedicated field. Let’s take an author field as an example; you may tokenize values for fulltext search which can cause faceting or result grouping to behave unexpectedly. However, by simply adding a untokenized field with a prefix like facet_author would fix this problem.

Yet, IndexDocValues are not tokenized or manipulated so you can simply add a IndexDocValuesField with the same name to achieve the same as with an additional field. Using IndexDocValues you end up with less logical fields and in this case less is more.

Some applications that use Lucene run in a memory limited environment, consequently functionalities like grouping and sorting cannot be used with a medium to large index. IndexDocValues can be retrieved in disk resident manner, meaning that document values aren’t loaded into the Java heap space. Applications using hundreds of fields where each of them could be used for sorting can quickly run out of memory when FieldCaches are pulled. Disk resident IndexDocValues can help to overcome this limitation by paying a price for search performance regression between 40% and 80%.

At this stage a lot of Lucene / Solr’s features are trying to catch up with IndexDocValues but some of them are already there. Let’s take a closer look at some of the features they already have. 

Solr Schema

As mentioned using IndexDocValues you end up with less logical fields. In Solr however this should even be simpler. You should just be able to mark a field to use the IndexDocValues but unfortunately this has not yet been exposed in Solr’s schema.xml. However there is an issue open (SOLR-2753) aiming to address this limitation if you are in the need of using IndexDocValues & you are using the Solr trunk it’s a great opportunity to get involved into the Solr development community. If you feel like it give it a go and code something up.

Result grouping

Result grouping by IndexDocValues is still work in progress but will be committed within the next week or two. Making use of IndexDocValues for result grouping is as straight forward as grouping by regular indexed values via FieldCache. If you are in Lucene land you can simply use a IndexDocValues GroupingCollectors and get started. (see Listing 1)

 Query query = new TermQuery(new Term("description", "lucene"));
Sort groupSort = new Sort(new SortField("name", SortField.Type.STRING));
Sort sortWithinGroup = new Sort(new SortField("age", SortField.Type.INT));

int maxGroupsToCollect = 10;
int maxDocsPerGroup = "3";
String groupField = "groupField";
ValueType docValuesType = ValueType.VAR_INTS;
boolean diskResident = true;
int groupOffset = 0;
int offsetInGroup = 0;

AbstractFirstPassGroupingCollector first = DVFirstPassGroupingCollector.create(
        groupSort, maxGroupsToCollect, groupField, docValuesType, diskResident
);

searcher.search(query, first);

Collection&lt;SearchGroup&lt;BytesRef&gt;&gt; topGroups = first.getTopGroups(groupOffset, false);
AbstractSecondPassGroupingCollector&lt;BytesRef&gt; second = DVSecondPassGroupingCollector.create(
        groupField, diskResident, docValuesType, topGroups, groupSort, sortWithinGroup, maxDocsPerGroup, true, true, false
);

TopGroups&lt;BytesRef&gt; result = second.getTopGroups(offsetInGroup);
Listing 1: Use IndexDocValues with ResultGrouping

Tip: If you’re dealing with IndexSearcher instances use Lucene’s SearcherManager

Note, the code is the same for all other types of grouping collectors as described in the grouping contrib page. The only new options are diskResident and valueType. The diskResident option controls whether the values aren’t loaded in the Java heapspace but read from disk directly utilizing the filesystems memory. The valueType defines the type of the value like FIXED_INT_32, FLOAT_32 or BYTES_FIXED_STRAIGHT. The ValueType variants also provide flexibility for how those values are stored. When indexing IndexDocValues you have to define “what” and “how” you are storing which should be consistent throughout your indexing process. However, Lucene provides some under-the-hood TypePromotion if you index different types on the same field across different indexing sessions.

Tip: The bytes type has a few variants. One variant is sorted. Using this variant you will see that you gain much more performance when using grouping.

Sorting

Sorting by float and int and bytes typed doc values are already committed to the trunk and can be used in Lucene land. Making use of this feature for sorting follows the same path as FieldCache based sorting. By default Lucene uses the FieldCache unless you specify the SortField#setUseIndexValues(boolean)

IndexSearcher searcher = new IndexSearcher(...);
SortField field = new SortField("myField", SortField.Type.STRING);
field.setUseIndexValues(true);
Sort sort = new Sort(field);
TopDocs topDocs = searcher.search(query, sort);
Listing 2: Use IndexDocValues with sorting

In general, using IndexDocValues instead of FieldCache is pretty straight forward. No additional API changes are required.

Conclusion

Besides sorting and result grouping there is more IndexDocValues work to do. Other features such as faceting and joining should have doc value based implementations as well.

I think that slowly features that use indexed term based implementations will move towards doc values based implementations. In my opinion, this is the way forward, then the inverted index is really used what it was built for in the first place: free text search.

Introducing Lucene Index Doc Values

October 27th, 2011 by
(http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/)

From day one Apache Lucene provided a solid inverted index datastructure and the ability to store the text and binary chunks in stored field. In a typical usecase the inverted index is used to retrieve & score documents matching one or more terms. Once the matching documents have been scored stored fields are loaded for the top N documents for display purposes. So far so good! However, the retrieval process is essentially limited to the information available in the inverted index like term & document frequency, boosts and normalization factors. So what if you need custom information to score or filter documents? Stored fields are designed for bulk read, meaning the perform best if you load all their data while during document retrieval we need more fine grained data.

Lucene provides a RAM resident FieldCache built from the inverted index once the FieldCache for a specific field is requested the first time or during index reopen. Internally we call this process un-inverting the field since the inverted index is a value to document mapping and FieldCache is a document to value datastructure. For simplicity think of an array indexed by Lucene’s internal documents ID. When the FieldCache is loaded Lucene iterates all terms in a field, parses the terms values and fills the arrays slots based on the document IDs associated with the term. Figure 1. illustrats the process.

Figure 1. Univerting a field to FieldCache

FieldCache serves very well for its purpose since accessing a value is basically doing a constant time array look. However, there are special cases where other datastructures are used in FieldCache but those are out of scope in this post.

So if you need to score based on custom scoring factors or you need to access per-document values FieldCache provides very efficient access to them. Yet, there is no such thing as free lunch. Uninverting the field is very time consuming since we need to first walk a datastructure which is basically the invers of what we need and then parse each value which is typically a String (until Lucene 4) or a UTF-8 encoded byte array. If you are in a frequently changing environment this might turn into a serious bottleneck.

Nevertheless, with FieldCache you get all or nothing. You either have enough RAM to keep all your data withing Java Heapspace or you can’t use FieldCache at all.

IndexDocValues – move FieldCache to the index

A reasonably new feature in Lucene trunk (4.0) tries to overcome the limitations of FieldCache by providing a document to value mapping built at index time. IndexDocValues allows you to do all the work during document indexing with a lot more control over your data. Each ordinary Lucene Field accepts a typed value (long, double or byte array) which is stored in a column based fashion.

This doesn’t sound like a lot more control yet, right? Beside “what” is stored IndexDocValues also exposes “how” those values are stored in the index. The main reason for exposing internal datastructure was that users usually konw way more about their data so why hide it, Lucene is a low level library. I will only scratch the surface of all the variant so see the ValueType javadocs for details.

For integer types IndexDocValues provides byte aligned variant for 8, 16, 32 and 64 bits as well as compressed PackedInts. For floating point values we currently only provide float 32 and float 64. However, for byte array values IndexDocValues offers a lot of flexibility. You can specify if you values have fixed or variable length, if they should be stored straight or in a dereferenced fashion to get a good compression in the case the number of distinct values is low.

Cool, the first limitation of FieldCache is fixed! There is no need to neither un-invert nor parse the values from the inverted index. So loading should be pretty fast and it actually is. I ran several benchmarks for loading up IndexDocValues for float 32 variants vs. loading FieldCache from the same data and the results are compelling – IndexDocValues loads 80 to 100 X faster than building a FieldCache.

So lets look at the second limitation – all or nothing. FieldCache is entirely RAM resident which might not be possible or desired in certain scenarios but since we need to un-invert there is not much of a choice.

IndexDocValues provide the best of both worlds whatever the users choice is at runtime. The API provides a very simply interface called Source which can either be entirely RAM resident (a signle shared instance per field and segment) or Disk-Resident (insteance per thread). With both RAM resident and on disk the same Random-Access interface is used to retrieve a single value per field for a given document ID.

The performance conscious of you might ask if the lookup performance is comparable to FieldCache since now we need to do an additional method call per value lookup vs. a single array lookup. The answer is: “it depends”! For instance if you choose to use PackedInts to compress your integers you certainly pay a price but if you choose a 32-bit aligned variant you can actually access the underlaying array via Source#getArray() just like Java NIO ByteBuffers.

You want it sorted?

By default Lucene returns search results sorted by the individual document score. However, one of the most commonly used features (especially in the Enterprise sector) is sorting by individual fields. This is yet another usecase where FieldCache is used for performance reasons. FieldCache can load up the already sorted values from the terms dictionary, providing lookup by Key & Ordinal. IndexDocValues provides the same functinality for fixed and variable length byte [ ] variants through a SortedSource interface. Figure 2 illustrates obtaining a SortedSource instance.

PerDocValues perDocValues = reader.perDocValues();
IndexDocValues docValues = perDocValues.docValues("sortField");
<span style="color: #3f7f59;">// Source source = docValues.getDirectSource() for disk-resident version</span>
Source source = docValues.getSource(); 
SortedSource sortedSource = source.asSortedSource();
BytesRef spare = new BytesRef();
sortedSource.getByOrd(2, spare);
int compare = spare.compareTo(new BytesRef("fooBar"));
Figure 2. Obtaining a SortedSource from a loaded Source instance

Recent benchmarks using SortedSource have shown equal performance to FieldCache and even with on-disk versions the performance hit is between 30% and 50%. These properties can be extremely helpful especialy for users with a large number of fields they need to sort on.

Wrapping up…

This introduction is the first post in a serious of posts for IndexDocValues. In the upcoming weeks we gonna publish more detailed post on how to use IndexDocValues for Sorting and Result Grouping. Since Lucene 4 also provides Flexible Scoring, using IndexDocValues with Lucene’s new Similarity deserves yet another post.

For the folks more interested in the technical background and how to extend IndexDocValues, understand how the internal type promotion works or event write your own low level implementation I’m planning to publish a low level codec post too. So stay tuned.

Indexing your Samba/Windows network shares using Solr

May 5th, 2011 by
(http://blog.trifork.com/2011/05/05/indexing-your-samba-windows-network-shares-using-solr/)

Many of JTeam’s clients want to search the content of their existing network shares as part of their Enterprise Search infrastructure. Over the last couple of years, more and more people are switching to Apache Lucene / Solr as their preferred, open source search solution. However, many still have the misconception that it is not possible to index the content of other enterprise content systems, like Microsoft Sharepoint and Samba / Windows shares using Solr. This blog entry will show you how to easily index the content of your network shares reusing some of the components JTeam built to make this really easy to do.

Read the rest of this entry »

Lucene indexing gains concurrency

May 3rd, 2011 by
(http://blog.trifork.com/2011/05/03/lucene-indexing-gains-concurrency/)

Imagine you are a Kindergarten teacher and a whole bunch of kids are playing with lego. Suddenly it’s almost 4pm and the big mess needs to be cleaned up, so you ask each kid to pick up one lego brick and put it in your hands. They all run around, bringing bricks to you one by one concurrently. Once your hands are full, you turn around, sort them by color and carefully put them in the right colored piles.

While you are sorting the bricks, kids are still arriving and they must waitfor you to turn back around to accept more bricks. You turn around, and theprocess repeats, until all the bricks are piled up behind you merge themtogether into one pile and put them in to the box where they belong- Closingtime!

Read the rest of this entry »

SSP 1.0 Video Tutorial

April 13th, 2011 by
(http://blog.trifork.com/2011/04/13/ssp-1-0-video-tutorial/)

Although SSP v1.0 has been replaced by the simpler 2.0 version, some of you out there are probably still using 1.0 version. Because we like to provide as much assistance as we can to our users, we’ve decided to publish a video tutorial I created on how to configure and use SSP v1.0. It walks you through the general concepts of spatial search, how to configure the plugin in both your schema.xml and solrconfig.xml and then shows some examples of issues search requests.

Read the rest of this entry »