Trifork Blog

Posts by Chris Male

On Schemas and Lucene

April 17th, 2012 by
(http://blog.trifork.com/2012/04/17/on-schemas-and-lucene/)

One of the very first thing users encounter when using Apache Solr is its schema. Here they configure the fields that their Documents will contain and the field types which define amongst other things, how field data will be analyzed. Solr’s schema is often touted as one of its major features and you will find it used in almost every Solr component. Yet at the same time, users of Apache Lucene won’t encounter a schema. Lucene is schemaless, letting users index Documents with any fields they like.

To me this schemaless flexibility comes at a cost. For example, Lucene’s QueryParsers cannot validate that a field being queried even exists or use NumericRangeQuerys when a field is numeric. When indexing, there is no way to automate creating Documents with their appropriate fields and types from a series of values. In Solr, the most optimal strategies for faceting and grouping different fields can be chosen based on field metadate retrieved from its schema.

Consequently as part of the modularisation of Solr and Lucene, I’ve always wondered whether it would be worth creating a schema module so that Lucene users can benefit from a schema, if they so choose. I’ve talked about this with many people over the last 12 months and have had a wide variety of reactions, but inevitably I’ve always come away more unsure. So in this blog I’m going ask you a lot of questions and I hope you can clarify this issue for me.

So what is a schema anyway?

Before examining the role of a schema, it’s worthwhile first defining what a schema is. So to you, what is a schema? and what makes something schemaless?

According to Wikipedia, a schema in regards to a database is “a set of formulas called integrity constraints imposed on a database”. This of course can be seen in Solr. A Solr schema defines constraints on what fields a Document can contain and how the data for those fields must be analyzed. Lucene, being schemaless, doesn’t have those constraints. Nothing in Lucene constrains what fields can be indexed and a field could be analyzed in different ways in different Documents.

Yet there is something in this definition that troubles me. Must a schema constrain? or can it simply be informative? Or put another way, if I index a field that doesn’t exist in my schema, must I get an error? If a schema doesn’t constrain, is it even a schema at all?

Field Name Driven vs. Data Type Driven

Assuming we have a schema, whether it constrains or not, how should it be oriented? Should it follow the style of databases where you state per field name the definition of that field, or should it use datatypes instead where you configure for, say, numeric fields, their definition?

The advantage of being field name driven is that it gives you fine grained control over each field. Maybe field X is text but should be handled differently to another text field Y. If you only have a single text datatype then you wouldn’t be able to handle the fields differently. It also simplifies the interaction with the schema. Anything needing access to how a field should be handled can look up the information directly using the field’s name.

The disadvantage of the field name driven approach is that it is the biggest step away from the schemaless world. A definition must be provided for every field and that can be cumbersome for indexes containing hundreds of fields, when the schema must be defined upfront (see below) or when new fields need to be constantly defined.

The datatype driven approach is more of a middle ground. Yes the definition for each datatype must be defined, but it wouldn’t matter how many actual fields were indexed as long as they mapped to a datatype in the schema. At the same time this could increase the difficulty of using the schema. There wouldn’t be any list of field names stored in the schema. Instead users of the schema would need to infer the datatype of a field before they could access how the field should be handled. Note, work on adding something along these lines to Solr has begun in SOLR-3250.

What do you think is best? Do you have other ideas how a schema could be structured?

Upfront vs. Incremental

Again assuming we have a schema, whether it be field name or datatype driven, should we expect the schema to be defined upfront before it’s used, or should it be able to be built incrementally over time?

The advantage of the schema being upfront is that is considerably reduces the complexity of the schema implementation. There is no need to support multi-threaded updates or incompatible changes. However it is also very inflexible, requiring all the fields ever to be used in the index be known before any indexing begins.

An incrementally created schema is the opposite of course since you can start from a blank slate and add definitions when you know them. This means a schema can evolve along with an index. Yet as mentioned above, it can be more complex to implement. Issues of how to handle multiple threads updating the schema and incompatible changes to the schema arise. Furthermore, where with an upfront schema you could ensure that when a field is used it will have a definition in the schema, with an incremental schema it may be that a field is accidentally used before its definition is added. Should this result in an error?

It may seem as though Solr requires its schemas be defined upfront. However in reality, Solr only requires a schema be defined when it brings a core online and prevents any changes while its online. When the core is taken offline, its schema can be edited. In SOLR-3251 ideas on how to add full incremental schema support to Solr are being discussed.

Storage: External vs. Index

No matter whether a schema is defined upfront or incrementally built, it will need to be stored somewhere. Solr stores its schema externally in its schema.xml file. This decouples the schema from the index itself since the same schema could, in theory, be used for multiple indexes. Changes to an external schema do not necessarily have to be impact an index (and vice versa), and an external schema doesn’t impact the loading of an index.

At the same time, the disconnect between an externally stored schema and an index means that they could fall out of sync. An index could be opened and new fields added without the schema being notified. Removing the definition of a field in the schema wouldn’t necessarily mean that field would be removed from the index.

One way to address this is to store the schema inside the index itself. Lucene already partially does this, having a very simple notion of FieldInfo. There has been considerable reluctance to increasing what’s stored in FieldInfo since it will slow down the incredibly efficient loading of indexes. How slow it would become would depend on how much data was stored in the schema. Yet this would ensure that the schema and the index were synchronized. Any changes to one would be represented in the other.

Given how controversial storing a schema in an index would be, do you think its worthwhile? Have you encountered synchronisation issues between your indexes and your schemas? Would you prefer control over where your schema were stored, allowing you to choose Cloud storage or maybe another database?

Does any of this even matter?

You might be thinking that I’ve totally wasted my time here and that actually there is no need for a schema module in Lucene. It could be argued that while having a schema is one of Solr’s major features, being schemaless is one of Lucene’s and that it should stay that way. Maybe that it’s best left up to Lucene users to create their own custom schema solutions if they need one. What do you think? Do you have some sort of Schema notion in your Lucene applications? If so, how does it work? If you use Solr, do you like how its schema works? If you could change anything, what would you change? I’d love to hear your thoughts.

Lucene Versions – Stable, Development, 3.x and 4.0

March 25th, 2012 by
(http://blog.trifork.com/2012/03/25/lucene-versions-stable-development-3-x-and-4-0/)

With Solr and Lucene 3.6 soon becoming the last featureful 3.x release and the release of 4.0 slowly drawing near, I thought it might be useful just to recap what all the various versions mean to you the user and why two very different versions are soon going to be made available.

A Brief History of Time

Prior to Solr and Lucene 3.1 and the merger of the developments of both projects, both were developed using single paths. This meant that all development was done on trunk and all releases were made from trunk. Although a simple development model, there were two major problems:

  • In order to establish a stable codebase to create a release from, trunk would need to go into a code freeze for several weeks. For fast paced projects like Solr and Lucene, this only served to stall development and frustrate contributors and users alike.
  • With backwards compatibility between minor versions compulsory, it was near impossible for large backwards incompatible, but highly desirable, changes to be made. Any large scale changes generally got bogged down in attempts to provide a compatibility layer. One way round this was to make regular major version releases. However this often meant major releases were often rushed before they were ready in order to get some popular feature out to users.

The end result was that releases were very infrequent, sometimes rushed, and generally a mess.

With the merger of the development of Solr and Lucene and a reassessment of the development model, it was decided that the projects should follow other open source projects and use a multi-path model consisting of stable and development versions. Trunk at the stage was branched to create the stable 3.x branch and trunk was changed to be the development 4.x version.

This decision has had the following benefits:

  • Development of the 3.x branch could focus on stable backwards compatible changes and bug fixes.
  • Since 3.x is stable, releases could be made much more regularly and without having to stop the more wild development of trunk.
  • Development of trunk can be unimpeded by stable releases and doesn’t need to focus on maintaining backwards compatibility.

So what does this mean for you the user?

Solr and Lucene 3.6 will be the last major release from the stable 3.x branch. This means it will be the last major release backwards compatible with all previous 3.x versions. Any future 3.x release will be merely a bug fix release and given how successful the 3.x versions have been, I doubt there will be many if any of those. Sometime after the release of 3.6, trunk will be branched to create the new stable 4.x branch and a stable 4.0 release will be made. Trunk will then move onto being the next development version 5.x.

Consequently, if you’re using a 3.x version I recommend that you take the opportunity to upgrade the 3.6 when it is released so that you can make the most of being able to just drop the libraries into your application. If however you’re either using a pre 3.x version and are looking to upgrade or using a 3.x version but needing one or many of the amazing new features currently in trunk, I recommend that you hold on a little while longer until 4.0 becomes the new stable release.

Conclusion

The change in development model has meant that Solr and Lucene are available to make more regular and clean releases. With the end of the 3.x era drawing near, users will soon be able to get their hands on the arguably more powerful 4.x stable releases. In preparation for the (hopefully) soon release of 4.0, I will be giving you an introduction to some of its new and exciting features.

If you’ve already begun upgrading to 4.x, share with us your experiences. Or if you’re thinking of upgrading and need some assistance, drop us a line and we’ll see how we can be of assistance.

Document Frequency Limited MultiTermQuerys

March 19th, 2012 by
(http://blog.trifork.com/2012/03/19/document-frequency-limited-multitermquerys/)

If you’ve ever looked at user generated data such as tweets, forum comments or even SMS text messages, you’ll have noticed there there are many variations in the spelling of words.  In some cases they are intentional such as omissions of vowels to reduce message length, in other cases they are unintentional typos and spelling mistakes.

Querying this kind of data since only matching the traditional spelling of a word can lead to many valid results being missed.  One way to includes matches on variations of a word is to use Lucene’s MultiTermQuerys such as FuzzyQuery or WildcardQuery.  For example, to find matches for the word “hotel” and all its variations, you might use the queries “hotel~” and “h*t*l”.  Unfortunately, depending on how many variations there are, the queries could end up matching 10s or even 100s of terms, which will impact your performance.

You might be willing to accept this performance degradation to capture all the variations, or you might want to only query those terms which are common in your index, dropping the infrequent variations and giving your users maximum results with little impact on performance.

Lets explore how you can focus your MultiTermQuerys on the most common terms in your index.

MultiTermQuery Rewriting & FilteredTermsEnum

Before getting into any code, lets quickly discuss how a MultiTermQuery like FuzzyQuery works.  MultiTermQuerys can be tought of as a high-level Queries; they aren’t directly executable but instead must be rewritten into an executable Query such as BooleanQuery using Query.rewrite(IndexReader).  Internally, MultiTermQuery.rewrite(...) retrieves a FilteredTermsEnum from its concrete implementation and iterates over its terms, adding them to the executable Query.

For FuzzyQuery, the FilteredTermsEnum will return all the terms in the index that are within the fuzzy margin of the target term.  For “hotel~” this could be “hotel”, “hot3l”, “hotl” or “hotal”.

It is this FilteredTermsEnum that we can manipulate to weed out those terms which are infrequent.

DocFreqFilteredTermsEnum

The following code is an implementation of FilteredTermsEnum which wraps another and filters out those terms which are found in less than a defined percentage of Documents in the index:

Lucene 3.x

(Note, in Lucene 3.x FilteredTermsEnum is called FilteredTermEnum)

 public class DocFreqFilteredTermEnum extends FilteredTermEnum {

  private final float rewrittenTermDocFreqPercent;
  private final int maxDoc;

  public DocFreqFilteredTermEnum(
            FilteredTermEnum termEnum,
            float rewrittenTermDocFreqPercent,
            int maxDoc) throws IOException {
    this.rewrittenTermDocFreqPercent = rewrittenTermDocFreqPercent;
    this.maxDoc = maxDoc;
    setEnum(termEnum);
  }

  @Override
  protected boolean termCompare(Term term) {
    return ((float) actualEnum.docFreq() / maxDoc) >= rewrittenTermDocFreqPercent;
  }

  @Override
  public float difference() {
    return ((FilteredTermEnum) actualEnum).difference();
  }

  @Override
  protected boolean endEnum() {
    // Keep going and let the wrapped FilteredTermEnum decide when to stop
    return false;
  }

}

This code can be used with a FuzzyQuery to filter out those terms which occur in less than 1% of Documents as follows:

 FuzzyQuery fuzzyQuery = new FuzzyQuery(term) {
  @Override
  protected FilteredTermEnum getEnum(IndexReader reader) throws IOException {
    return new DocFreqFilteredTermEnum(super.getEnum(reader), 0.01, reader.maxDoc());
  }
};

Or alternatively, as of Lucene 3.6 you can override RewriteMethod.getTermsEnum(IndexReader, MultiTermQuery) to return DocFreqFilteredTermEnum, so that you don’t need any subclass any MultiTermQuery implementation.

Lucene 4.x

 public class DocFreqFilteredTermsEnum extends FilteredTermsEnum {

  private final float rewrittenTermDocFreqPercent;
  private final int maxDoc;

  public DocFreqFilteredTermsEnum(
            FilteredTermsEnum termEnum,
            float rewrittenTermDocFreqPercent,
            int maxDoc) throws IOException {
    super(termsEnum);
    this.rewrittenTermDocFreqPercent = rewrittenTermDocFreqPercent;
    this.maxDoc = maxDoc;
  }

  @Override
  protected AcceptStatus accept(BytesRef term) {
    return (((float) actualEnum.docFreq() / maxDoc) >= rewrittenTermDocFreqPercent) ?
    AcceptStatus.YES : AcceptStatus.NO_AND_SEEK;
  }
}
</pre> <p> Again to use this code:</p> <pre class="brush:java"> FuzzyQuery fuzzyQuery = new FuzzyQuery(term) {
  @Override
  protected FilteredTermEnum getTermsEnum(IndexReader reader) throws IOException {
    return new DocFreqFilteredTermsEnum(super.getTermsEnum(reader), 0.01, reader.maxDoc());
  }
};

And again, like Lucene 3.6, you can alternatively override RewriteMethod.getTermsEnum(IndexReader, MultiTermQuery).

Conclusion

Having control over the terms used in a MultiTermQuery can considerably improve your search performance while still allowing you to overcome spelling variations.  Have you encountered a scenario where you’ve been executing MultiTermQuerys over use generated data? Did they impact the performance of your searches? Was that due to the larger number of terms being queried? If so, we’d love to hear about your experiences.

Apache Lucene & Solr 3.5.0

December 14th, 2011 by
(http://blog.trifork.com/2011/12/14/apache-lucene-solr-3-5-0/)

Just a little over two weeks ago Apache Lucene and Solr 3.5.0 were released.  The released artifacts can be found here and here respectively.  As part of the Lucene project’s effort to do regular releases, 3.5.0 is another solid release providing a handful of new features and bugs.  The following is a review of the release, focusing on some changes which I in particular found of interest.

Apache Lucene 3.5.0

Lucene 3.5.0 has a number of very important fixes and changes to both its core index management and userland APIs:

  • LUCENE-2205: Internally Lucene manages a dictionary of terms in its index which is heavily optimized for quick access. However the dictionary can consume a lot of memory, especially when the index holds millions/billions of unique terms. LUCENE-2205 considerably reduces (3-5x) this memory consumption through a rewrite of the datastructures and classes used to maintain and interact with the dictionary.
  • LUCENE-2215: Strangely enough, despite being one common usercases, Lucene has never provided an easy and efficient use for the deep paging API. Instead users have had to use the existing TopDocs driven API which is very inefficient when used with large offsets, or have had to roll their own Collector. LUCENE-2215 addresses this limitation by adding searchAfter methods to IndexSearcher which will efficiently find results that come ‘after’ a provided document in result sets.
  • LUCENE-3454: As discussed here, Lucene’s optimize index management operation has been renamed to forceMerge to clarify the common misunderstanding that the operation is vital. Some users had considered it so vital that they optimized after each document was added. Since 3.5.0 is a minor release, IndexWriter.optimize() has only been deprecated however it has been removed from Lucene’s trunk therefore it is recommended that users move over to forceMerge where appropriate.
  • LUCENE-3445, LUCENE-3486: As part of the effort to provide userland classes with easy to use APIs for managing and interacting Lucene indexes, LUCENE-3445 adds a SearchManager which handles the boilerplate code so often written to manager IndexSearchers across threads and reopens of underlying IndexReaders. LUCENE-3486 goes one step further by adding a SearcherLifetimeManager which provides an easy-to-use API for ensuring that users uses the same IndexSearcher as they ‘drill-down’ or page through results. Interacting with a new IndexSearcher during paging can mean the order of results will change resulting in a confusing user experience.
  • LUCENE-3426: When using NGrams (for the term “ABCD”, the NGrams could be “AB, “BC”, “CD”) and PhraseQuerys, the Queries can be optimized by removing any redundant terms (the PhraseQuery “AB BC CD” can be reduced to “AB CD”). LUCENE-3426 provides a new NGramPhraseQuery which does such optimizations, where possible, on Query rewrite. The benefits, a 30-50% performance improvement in some cases, especially beneficial for CJK users, where NGrams are prevalent.

Lucene 3.5.0 of course contains many smaller changes and bug fixes.  See here for full information about the release.

Apache Solr 3.5.0

Benefiting considerably from Lucene 3.5.0, Solr 3.5.0 also contains a handful of useful changes:

  • SOLR-2066: Continuing to be one of Solr’s most sought after features, the power and flexibility of result grouping continues with SOLR-2066 which adds distributed grouping support. Although coming at a cost of 3 round trips to each shard, SOLR-2066 all but closes the book on what was once considered an extremely difficult feature to add to Solr and sets Solr apart from search system alternatives.
  • SOLR-1979: When creating a multi-lingual search system, it is often useful to be able to identify the language of a document as it comes into the system. SOLR-1979 adds out-of-box support for this to Solr by adding a langid Solr module containing a LanguageIdentifierUpdateProcessor which leverages Apache Tika’s language detection abilities. In addition to being able to identify which language a document is, the UpdateProcessor can map data into language specific fields, a common way of supporting documents of different languages in a multi-lingual search system.
  • SOLR-2881: Is all about sorting Documents with missing values in a field (known as sortMissingLast) improved in Lucene’s trunk and 3x branch, support for using sortMissingLast with Solr’s Trie fields has been added. Consequently it is now possible to control whether those Documents with no value in a Trie field appear first or last when sorted.
  • SOLR-2769: Solr users are now able to use Hunspell for Lucene through the HunspellStemFilterFactory. The factory allows the affix and multiple dictionary files to be specified, allowing Solr users to use some of the over 100 Hunspell dictionaries used in projects like OpenOffice and Mozilla Firefox in their analysis chain. Very useful for users having to support rarely used languages.

Solr 3.5.0 also contains many smaller fixes and changes.  See the CHANGES.txt for full information about the release.

Lucene & Solr 3.6.0?

With changes still being made to the 3x branch of both Lucene and Solr, and the release of Lucene and Solr 4 it is very likely that 3.6.0 will be released in a couple of months time.

Analysing European Languages With Lucene

December 7th, 2011 by
(http://blog.trifork.com/2011/12/07/analysing-european-languages-with-lucene/)

It seems more and more often these days that search applications must support a large array of European languages.  Part of supporting a language is analysing words to find their stem or root form.  An example of stemming is the reduction of the words  “run”, “running”, “runs” and “ran”  to their stem “run”.  In the past all this required in Lucene was use of the Analyzers for the desired languages.  Yet nowadays developers are presented with a plethora of TokenFilter alternatives, from the traditional Snowball based filters, through the recently added Hunspell filter, to the vaguely named Light and Minimal filters.

In this blog I want to compare and contrast each of these options focusing on their algorithms and how they achieve their goals and hopefully giving you enough information so you can make an informed decision about which option fits your usecase.

Snowball – Defacto no more?

For most of Lucene’s history, the defacto analysis framework for European languages has been driven by Snowball.  Snowball is a domain specific language for defining stemming algorithms for European languages, from which ANSI C and Java implementations can be generated.  Started by Martin Porter, the creator of the Porter algorithm for stemming English, Snowball provides Stemmers for the commonly used European languages such as French, Dutch, German, Spanish and Italian.

Snowball Stemmers use an algorithm which allows them to stem any word in a language, whether the word would appear in an official dictionary or not.  Although the algorithms for each language vary, Snowball Stemmers are considered very aggressive, often removing large suffixes to reduce words to a stem.  For usage with Lucene, this can have the positive effect of reducing the number of unique terms in the index.  Yet it can also heavily impact the quality of your search results.  Consider the query string “international”.  The Snowball Stemmer for English will remove the common suffix “ational” from the word creating the stem “intern”.  As you can, this could lead to search results being found that have no relation to the original query.

Despite their potentially negative impact, Snowball Stemmers continue to be popular due to reliability, long lifespan and their development by academic linguistic communities.

SnowballFilter is the primary entry point for Snowball Stemmers in Lucene.

Hunspell for Lucene – Beyond just Hungarian

Hunspell for Lucene, a pure Java port of the Hunspell analytical system found in major applications like OpenOffice and Firefox, is a new stemming system for Lucene which supports every European language you’ve ever heard of and many you haven’t.

Unlike Snowball, Hunspell is not a purely algorithmic system.  Instead Hunspell uses a file containing the grammatical rules for a language encoded in a standardised format, and a dictionary file containing the language’s valid stems.  During the analysis of a word, Hunspell loops over the grammatical rules, applying those applicable, and then checking if a valid stem is found.  The advantage of this process is that very complicated grammatical rules can be applied such as the removal of multiple suffixes and prefixes, and the addition of new characters.  Furthermore, Hunspell will continue to stem a word until all valid stems have been found, rather than just one.  However, unlike Snowball which can analyse any word, Hunspell will only find stems for those words which match its rules or have stems found in its dictionary.

Because the linguistic logic of Hunspell is separated from the programmatic algorithm, adding support for new languages is remarkably easy.  As such, hundreds of Hunspell dictionaries have been created.  Unfortunately the quality of the dictionaries varies considerably.  Therefore although Hunspell can be used to analyse Maltese for example, the quality of the analysis will most likely be much less than the analysis of English.  As such, the impact on your Lucene index and search results are hard to predict.

Hunspell for Lucene has only recently been developed and added to the Lucene codebase.  As such I recommend thorough testing before putting it into production.  Equally, there are some dictionary formats which are not yet supported.

HunspellStemFilter is the primary entry point for Hunspell for Lucene.

Light & Minimal – Analysis made easy

While both Snowball and Hunspell for Lucene aggressively try to unravel complex grammatical rules to find ‘true’ stems of words, the rather vague named Light and Minimal Stemmers found in Lucene, take a very different direction.  Developed from the work done by Jacques Savoy and discussed in detail here, both Stemmer forms apply few grammatical rules using incredibly simple and efficient algorithms.  Most of the rules applied are the removal of plurals and common suffixes from nouns and the normalisation of special characters.  The effect on a Lucene index, when compared against both Snowball and Hunspell for Lucene, is many more unique terms but a considerably reduced likelihood of unrelated words being reduced to the same stem.

The difference between the Light and Minimal form of a Stemmer for say French, comes down to how many rules are applied.  The FrenchLightStemmer, a mere 203 lines of code vs the Snowball derived FrenchStemmer’s 608 lines, removes both plurals and some common suffixes, as well as doing character normalisation.  In comparison, the FrenchMinimalStemmer, coming in at a tiny 18 lines of code, just removes plurals.

Due to their less aggressive analysis and incredible efficiency, the Light and Minimal Stemmers provide a very compelling alternative to the comparatively slow and monolithic Snowball and Hunspell stemmers.  Unfortunately, Light and Minimal Stemmers only exist for a few European languages.

Light/MinimalStemFilter implementations for Lucene can be found under the appropriate language packages in org.apache.lucene.analysis.*.

Conclusion

With multiple options for analysing European languages, it can seem quite daunting deciding which to use.  However, with each approaching the same problem in very different ways and the range of languages supported growing everyday, its well worth experimenting before settling on which fits your usecase.

Compromise is hard

November 22nd, 2011 by
(http://blog.trifork.com/2011/11/22/compromise-is-hard/)

Whenever I talk my job with friends who are also IT professionals, the most commonly desired aspect is that I get to work in a community where everybody has a voice.  Apache Software Foundation projects like Solr and Lucene tend to work from the motto that if it didn’t happen on the mailing list, it didn’t happen.  This means that no matter how experienced you are, how many years you’ve been working on a project, or how knowledgeable you are about an issue, you can always chime in with your 5c worth.  I do have to agree with my friends, this, probably beyond anything else, is what I love most about my job.

Yet this freedom doesn’t come without its downsides.  Discussions on simple issues can quickly snowball into impassioned debates or worst still, flame-wars.  In Solr and Lucene, where we prefer to come to a consensus on a issue or code change, rather than railroading them through, this often means compromises must be made in order to appease all parties.  Let me tell you, as someone who often plays the roll of a peacemaker, compromise is hard.

To understand what I mean, lets quickly examine what happens in a corporate environment when an issue arises about lets say, a product.  Generally a group of people who know about the product will be brought together to discuss how to address the issue.  Best case scenario consensus is immediately reached about how best to go forward.  Worst case, the group fragments into sub-groups with different opinions and a debate breaks out.  However even with different opinions, the sub-groups are motivated to compromise by the fact that:

  • Without agreement, the issue will be stalled, product ruined, the company potentially doomed and jobs lost. People like keeping their jobs.
  • Developing the product is what they’re paid to do
  • Management will most likely make its own decision if agreement is not reached

Compromise will be reached, even if it means some individuals having to relent entirely.

In comparison, none of these motivations exist in the discussion about an issue in Solr or Lucene.  All committers are equal, there is no management which will make a decision for the community and a single committer can veto a change if they have a valid technical reason.  Committers come from a wide variety of backgrounds and cultures.  Some are employed solely to work on the projects, some work for organisations that use the project artifacts in their own projects, some may very well be users or hobbyists who have invested some of their own time.  There is no overriding corporate entity, no community agreed strategy.  Communication is done via mailing lists and JIRA issues which provide a degree of anonymity and physical distance.

Consequently, when a committer finds themselves disagreeing with a change being suggested, there is not necessarily any reason for them to be compromising.  Assuming they have a valid technical reason for disagreeing, they can be as stubborn and unrelenting as they like.  Their livelihood will most likely be unaffected if another’s  issue doesn’t go forward.  Other issues will continue to be developed and there is very little likelihood that the projects will stall.  Trying to find a compromise or to encourage others to, in this sort of environment, can be very challenging.

Yet it is exactly this kind of environment which has lead to many successes in Solr and Lucene.  Just at that moment where all parties involved have thrown their toys from the cot and hair is being pulled, time and time again someone has stepped forward with a new idea or a new direction which all parties agree on.  Whether it be the usage of a certain design pattern, the same pattern again, or the naming of core functionality, compromises have eventually been made and agreements found.  Although I so often hate it at the time, I sometimes do wonder whether this is actually what I love most about my job.

We need you

Hopefully, in a roundabout way, I’ve shown that being part of the Solr and Lucene community, albeit challenging, is also very rewarding.  If you’re either tired of having your voice ignored, passionate about open source search, or just like being part of a lively debate, I encourage you to get involved in the community, become part of mailing list discussions and contribute to issues.

The Lucene Sandbox

September 15th, 2011 by
(http://blog.trifork.com/2011/09/15/the-lucene-sandbox/)

Few people are aware that Apache Lucene has been part of the ASF since 2001, becoming a Top Level Project in 2005. 10 years is an eternity in IT where ideas tend to evolve in leaps and bounds. Over that 10 year period many users, contributors and committers have come and gone from Lucene, each shaping in their own way what has become the defacto Open Source Search library.  But of course good ideas from 10 years ago are not necessarily so good today.

Contribs and Module Consolidation

Before I talk about the Sandboxing process in more detail, it’s perhaps best to understand what the Lucene codebase has been like until recently and moreover, where we want to go.
For some time now, Lucene has had a section in its codebase known as contrib which is home to code that is related to Lucene but for whatever reason, has been deemed as not belonging in the core code. Examples of commonly used code that has been or remains part of the contrib are Lucene’s Highlighters, its large array of analysis tools and the flexible QueryParser framework.
One of the unfortunate features of the contrib is that code quality and algorithmic efficiency vary greatly and can go unaddressed for many years. I recently saw a class where the last SVN commit was 2005. I can assure you that many parts of Lucene have changed since then.
With Lucene 4 and the merger of Lucene and Solr’s development, the idea was put forward that the analysis tools in the contrib and from around the rest of the codebase should be consolidated together as a module along side Lucene and Solr. Once consolidated, the code would be held to a high standard comparable to that of Lucene’s core. With the success of this consolidation, it was decided that other concepts in Lucene should also be pulled together into modules. Two concepts that were immediately thrown around where Lucene’s wide variety of QueryParsers and its many exotic Query implementations.

It works but it’s kind of sandy

Consolidating Lucene’s QueryParsers was not such a problem. While they do suffer from their fair share of flaws, because they are so commonly used they are well tested and operate efficiently.  The same can not be said however, about all the exotic Query implementations.
The problem confronting the consolidation of the Query impls was not that they didn’t work. If that had been the problem then it would have been fine to remove them. The problem was that they did work, but were poorly written, documented or tested, or operated in inefficient ways. To put it bluntly, the code was not necessarily up to the standard that we expected for the new queries module.
Deleting code which exists in such a grey area would be a bold decision and could prove a mistake. It was obviously added for a reason and no doubt has some users. Some code with some effort could probably make its way to being module worthy. Therefore the idea of a sandbox for Lucene was floated. A place of lesser standards where ‘sandy’ code which is not deemed module worthy, could go so that it can continue to be used and maybe even improved, before the decision about its ultimate fate is made.

Sandy Code and The Future

So what code in Lucene is sandy? It’s very hard to say right now and is definitely subjective. Long depreciated or poorly written code are obviously signs, but what about the lack of testing? As such, only that code which is in the sandbox can be called sandy for sure. However as we continue to consolidate the various parts of Lucene into modules and slowly shutdown the contrib, it will no doubt grow further.
Do you have any thoughts on what perhaps belongs in the sandbox? Do share them with us, it would be good to hear from you.

Hotspotting Lucene With Query Specialization

September 12th, 2011 by
(http://blog.trifork.com/2011/09/12/hotspotting-lucene-with-query-specialization/)

Most Java developers probably take the technology behind HotspotTM  for granted and assume it’s best suited to optimize their code.  While they might know that choosing the best implementation of List will have a  big impact on their program’s performance, they probably shy away from worrying about the cost of virtual method calls, believing Hotspot knows best.

With most applications and in most situations this belief is well-founded and what I’m going to discuss is not trying to change the world view on Hotspot or encourage people to ‘optimize prematurely’. Rather, I want to explore the how we might aid Hotspot to produce the most optimal code for executing Lucene Querys and in the process increase query performance by over 100%.

Red Hot FunctionQuery Execution

One of the most popular Querys in Apache Solr and Lucene is FunctionQuery, which applies a function (known as a ValueSource) to every document in the index, returning the result of the function as the document’s score.  Rarely in Lucene’s Querys is logic applied to every document since Querys are responsible for matching (finding a relevant subset of documents) as well as scoring. Consequently, scoring logic is usually only applied to the matched subset. FunctionQuerys are not focused on matching, instead they manipulate the scores of documents in extensible ways.

To understand the problems this might cause, imagine you have an index with 100 million documents containing two fields which you wish to sum together and use as document scores. Even though the logic required to implement this is very simple, the fact that it will be executed 100 million times for every query means that it will glow hot – red hot.

As mentioned, FunctionQuery is designed to be extensible. It accepts implementations of ValueSource of which there are many in Lucene’s codebase. One such abstract implementation is MultiFloatFunction, itself accepting a list of ValueSources. MultiFloatFunction computes its function result by applying the logic implemented in the method func() to each of the results taken from its ValueSources. SumFloatFunction, a concrete implementation of MultiFloatFunction, adds sum logic to func().

The following UML class diagram illustrates the hierarchy of ValueSource classes necessary to add the values of two integer fields together.

Note, FieldValueSourceis a ValueSource implementation that returns the value of a field for a document. IntFieldValueSource assumes the value is an integer.

Although this seems convoluted already, in practice this example is not too bad. The code boils down to two values being read from two different arrays and added together. However, if you wanted to add two field values together and then multiply the result by a third field value, you would need to add another MultiFloatFunction implementation (ProductFloatFunction) and another IntFieldSource.  Suddenly we have a large stack of objects with many layers of virtual calls that must be executed per document per Query just to execute the formula (A + B) * C.

Specialized FunctionQuerys

In a simple benchmark, I indexed 100,000 documents containing random integer values for 3 fields.  I then assembled the ValueSource required to execute A + B and computed the time required to execute 100,000 FunctionQuerys using the ValueSource. The result, on my underpowered laptop, was an average execution time of 3ms – not bad, not bad at all. However when I repeated the same test, this time using the ValueSource for (A + B) * C, the execution time jumped up to 25ms per query – still not bad, especially given the benchmarks were not on a dedicated server.

However, I wondered if part of the increase in cost were the extra levels of virtual method calls that needed to be made. To test this, I created a simple interface called Function (shown below) and created an implementation of FunctionQuery which used Functions instead of
ValueSources.

public interface Function {
float f(int doc);
void setNextReader(IndexReader indexReader) throws IOException;
}

I then created an implement to support the formula (A + B) * C as follows:

public final class ABCFunction {
private final int[] aVals, bVals, cVals;
public float f(int doc) { return (aVals[doc] + bVals[doc]) * cVals[doc]; }
public void setNextReader(IndexReader indexReader) throws IOException {
// load arrays from FieldCache
}
}

Executed against the same index, the time per FunctionQuery dropped to just 13ms – a huge improvement.

Whats the difference? Well the above implementation does not use object composition. The only virtual method call executed is that from FunctionQuery to the Function implementation. Even though the logic is the same as that executed using the ValueSources (3 values loaded from arrays, arithmetic applied), Hotspot is able to execute the simplified code with fewer method invocations much quicker.

Extensibility and Performance

While hopefully you are impressed by the performance improvement, you’re also probably thinking “Well that’s great Chris, but this is not extensible or reusable” and you would be right.  ValueSource’s composition architecture allows us to quickly assemble a wide variety of functions while hiding away much of the complexity (some ValueSources are very complex).  They also lend themselves to be easy created while parsing a query string.

Yet it could be argued that the above code illustrates that if your search application uses a small set of known functions regularly, there is a considerable performance benefit to be gained by providing specialised implementations.  From my experience, most Solr applications use only a couple of functions, primarily composed of the rudimentary arithmetic variety.

ASTs and Executables

While toying with ValueSource and comparing it against my own ABCFunction, I couldn’t help but notice the similarity between ValueSources and ASTs, and ABCFunction and Java Bytecode. When compiling source code, most compilers create an Abstract Syntax Tree (AST) representation of the code.  ASTs are a very close representation of the original source code, but use typed objects and composition, instead of text, to represent code statements. The following is a UML diagram showing the AST javac builds to represent the statement int a = 10:

Having constructed an AST and after applying some optimizations, most compilers then generate the executable form of the source code. For Javac, this is Java Bytecode.

An assembled ValueSource is very much like an AST. It describes the function that is to be executed and as mentioned, can be very easily constructed during query string parsing. But like an AST, it perhaps isn’t the most efficient form for execution.  Instead, it needs to optimized (for a future article), and then converted to an optimal execution format. ABCFunction can be seen as an example of such a format.

So how do we get from ValueSource to our optimal Function implementation?
Well, we could generate source code and use Javac to compile it for us, or we could generate bytecode ourselves, or we could even generate native code. However we do it, we are creating the foundations for a Query Compiler.

Query Compilation and Beyond

Hopefully I’ve tickled your interest in how we might go about using code specialisation to increase the performance of Hotspot’s executions of Query code. In a future article, I will extend my explorations to the execution of all Lucene’s Querys and explain in more detail how we might go about building a Query Compiler for Lucene.

The State and Future of Spatial Search

May 11th, 2011 by
(http://blog.trifork.com/2011/05/11/the-state-and-future-of-spatial-search/)

The release of Solr 3.1, containing Solr’s official spatial search support, has coincided with a new debate about the future of spatial search in Solr and Lucene. JTeam has been involved in the development of spatial search support for a number of years and we maintain our own spatial search plugin for Solr. Consequently this debate is of great interest to us. As it seems unclear at this stage whether the development of spatial search should stay within the Lucene project, or even the ASF at all, I feel its important to get some opinions on the matter and who better to turn to than our readers. So lets take a quick journey through the history of spatial search in Solr and Lucene, talk about what currently is supported and whats not, and discuss the issues confronting its future.

Read the rest of this entry »

SSP 1.0 Video Tutorial

April 13th, 2011 by
(http://blog.trifork.com/2011/04/13/ssp-1-0-video-tutorial/)

Although SSP v1.0 has been replaced by the simpler 2.0 version, some of you out there are probably still using 1.0 version. Because we like to provide as much assistance as we can to our users, we’ve decided to publish a video tutorial I created on how to configure and use SSP v1.0. It walks you through the general concepts of spatial search, how to configure the plugin in both your schema.xml and solrconfig.xml and then shows some examples of issues search requests.

Read the rest of this entry »