Trifork Blog

Posts Tagged ‘release’

AFAS’ CIO Rolf de Jong hosts Panel Discussions at Axon Seminar

February 12th, 2013 by


On February 28th 2013, Trifork will organize the Axon 2 launch seminar. During this seminar, visitors will be introduced to CQRS and Axon Framework, of which version 2 was released just a few weeks ago. The seminar will be an afternoon packed of technical insight, case studies and panel discussions, whereby we look forward to a number of informative and interactive sessions. Rolf de Jong, CIO of AFAS ERP Software, has accepted our invitation to host the panel discussions where a team of experts will share their thoughts on CQRS, Axon Framework and software development in general.  Read the rest of this entry »

Axon Framework 2.0 Released!

January 22nd, 2013 by

After laying the ground work for this release about a year ago, we now proudly announce the next major release of Axon Framework! The 2.0 release is a big step forward compared to the previous version, as it contains a thorough restructuring of components, making it even easier to build scalable and extensible applications.
Read the rest of this entry »

Apache Lucene & Solr 3.5.0

December 14th, 2011 by

Just a little over two weeks ago Apache Lucene and Solr 3.5.0 were released.  The released artifacts can be found here and here respectively.  As part of the Lucene project’s effort to do regular releases, 3.5.0 is another solid release providing a handful of new features and bugs.  The following is a review of the release, focusing on some changes which I in particular found of interest.

Apache Lucene 3.5.0

Lucene 3.5.0 has a number of very important fixes and changes to both its core index management and userland APIs:

  • LUCENE-2205: Internally Lucene manages a dictionary of terms in its index which is heavily optimized for quick access. However the dictionary can consume a lot of memory, especially when the index holds millions/billions of unique terms. LUCENE-2205 considerably reduces (3-5x) this memory consumption through a rewrite of the datastructures and classes used to maintain and interact with the dictionary.
  • LUCENE-2215: Strangely enough, despite being one common usercases, Lucene has never provided an easy and efficient use for the deep paging API. Instead users have had to use the existing TopDocs driven API which is very inefficient when used with large offsets, or have had to roll their own Collector. LUCENE-2215 addresses this limitation by adding searchAfter methods to IndexSearcher which will efficiently find results that come ‘after’ a provided document in result sets.
  • LUCENE-3454: As discussed here, Lucene’s optimize index management operation has been renamed to forceMerge to clarify the common misunderstanding that the operation is vital. Some users had considered it so vital that they optimized after each document was added. Since 3.5.0 is a minor release, IndexWriter.optimize() has only been deprecated however it has been removed from Lucene’s trunk therefore it is recommended that users move over to forceMerge where appropriate.
  • LUCENE-3445, LUCENE-3486: As part of the effort to provide userland classes with easy to use APIs for managing and interacting Lucene indexes, LUCENE-3445 adds a SearchManager which handles the boilerplate code so often written to manager IndexSearchers across threads and reopens of underlying IndexReaders. LUCENE-3486 goes one step further by adding a SearcherLifetimeManager which provides an easy-to-use API for ensuring that users uses the same IndexSearcher as they ‘drill-down’ or page through results. Interacting with a new IndexSearcher during paging can mean the order of results will change resulting in a confusing user experience.
  • LUCENE-3426: When using NGrams (for the term “ABCD”, the NGrams could be “AB, “BC”, “CD”) and PhraseQuerys, the Queries can be optimized by removing any redundant terms (the PhraseQuery “AB BC CD” can be reduced to “AB CD”). LUCENE-3426 provides a new NGramPhraseQuery which does such optimizations, where possible, on Query rewrite. The benefits, a 30-50% performance improvement in some cases, especially beneficial for CJK users, where NGrams are prevalent.

Lucene 3.5.0 of course contains many smaller changes and bug fixes.  See here for full information about the release.

Apache Solr 3.5.0

Benefiting considerably from Lucene 3.5.0, Solr 3.5.0 also contains a handful of useful changes:

  • SOLR-2066: Continuing to be one of Solr’s most sought after features, the power and flexibility of result grouping continues with SOLR-2066 which adds distributed grouping support. Although coming at a cost of 3 round trips to each shard, SOLR-2066 all but closes the book on what was once considered an extremely difficult feature to add to Solr and sets Solr apart from search system alternatives.
  • SOLR-1979: When creating a multi-lingual search system, it is often useful to be able to identify the language of a document as it comes into the system. SOLR-1979 adds out-of-box support for this to Solr by adding a langid Solr module containing a LanguageIdentifierUpdateProcessor which leverages Apache Tika’s language detection abilities. In addition to being able to identify which language a document is, the UpdateProcessor can map data into language specific fields, a common way of supporting documents of different languages in a multi-lingual search system.
  • SOLR-2881: Is all about sorting Documents with missing values in a field (known as sortMissingLast) improved in Lucene’s trunk and 3x branch, support for using sortMissingLast with Solr’s Trie fields has been added. Consequently it is now possible to control whether those Documents with no value in a Trie field appear first or last when sorted.
  • SOLR-2769: Solr users are now able to use Hunspell for Lucene through the HunspellStemFilterFactory. The factory allows the affix and multiple dictionary files to be specified, allowing Solr users to use some of the over 100 Hunspell dictionaries used in projects like OpenOffice and Mozilla Firefox in their analysis chain. Very useful for users having to support rarely used languages.

Solr 3.5.0 also contains many smaller fixes and changes.  See the CHANGES.txt for full information about the release.

Lucene & Solr 3.6.0?

With changes still being made to the 3x branch of both Lucene and Solr, and the release of Lucene and Solr 4 it is very likely that 3.6.0 will be released in a couple of months time.

Simon says: optimize is bad for you….

November 21st, 2011 by

In the upcoming Apache Lucene 3.5 release we deprecated an old and long standing method on the IndexWriter. Almost everyone who has ever used Lucene knows, IndexWriter#optimize() – I expect a lot of users to ask why we did this, well this is one of the reasons I wrote this blog. Let me go back a couple of steps and try to first explain what optimized did and even more importantly why previous versions of Lucene actually had this option.

Lucene writes segments?

One of the principles in Lucene since day one is the write-once policy. We never write a file twice. When you add a document via IndexWriter it gets indexed into the memory and once we have reached a certain threshold (max buffered documents or RAM buffer size) we write all the documents from the main memory to disk; you can find out more about this here and here. Writing documents to disk produces an entire new index called a segment. Now, when you index a bunch of documents or you run incremental indexing in production here you can see the number of segments changing frequently.  However, once you call commit Lucene flushes its entire RAM buffer into segments, syncs them and writes pointers to all segments belonging to this commit into the SEGMENTS file.

So far so good. Since Lucene never changes files how is it updating documents? The truth is it doesn’t. In Lucene an update is just an atomic add & delete, meaning that Lucene adds the updated document to the index and marks all previous versions as deleted. Alright, but how do we get rid of deleted documents then?

Housekeeping & Per-Segment Search?

Obviously, Lucene needs to do some housekeeping here. What happens under the hood is that from time to time segments are merged into (usually) bigger segments to:

  • reduce the number of segments to be searched
  • expunge deleted documents (influences scoring due to their contribution to Document Frequency)
  • keep the number of file handles small (Lucene tends to have a fair few files open)
  • reduce disk space

All this happens in the background controlled by a configurable MergePolicy.  The MergePolicy takes care of keeping the number of segments balanced and merges them together once needed. I don’t want to go into details on merging here, which is clearly way out of scope for this post – maybe I or someone else will come back to this another time.  Yet, there is another way of forcing merges to happen, you can call  IndexWriter#optimize() which merges all existing segments into one large segment.

Optimize sounds like a very powerful operation, right? It certainly is powerful but  “if all you have is a hammer, everything looks like a nail.  Back in earlier versions of Lucene (before 2.9) Lucene treated the underlying index as one big index and reopening the IndexReader invalidated all datastructures & caches. This has changed quiet a lot towards a per-segment orientation. Almost all structures in Lucene now work on a per-segment basis which means that we only load changes or reopen instead of the entire index. As a user it still might look like one big index but once you look a little under the hood you see everything works per- segment like this IndexSearcher snippet:

 public void search(Weight weight, Filter filter, Collector collector)
      throws IOException {
    // iterate through all segment readers & execute the search
    for (int i = 0; i < subReaders.length; i++) {
      // pass the reader to the collector 
      collector.setNextReader(subReaders[i], docBase + docStarts[i]);
      final Scorer scorer = ...;
      if (scorer != null) { // score documents on this segment
Figure 1. Executing searches across segments in IndexSearcher
Each search you execute in Lucene runs on each segment in the index sequentially, unless you have an optimized index. Well, this sounds like you should optimize all the time? Wrong! Think about it again, optimizing your index will build one large segment out of your maybe 5 or 10 or N segments; this has several side-effects:
  • enormous amount of IO when merging into one big segment.
  • can take up to hours  when your index is large
  • reopen can cause unexpected memory peaks

You say this doesn’t sound that bad? Well, if you run this in production with a large index optimizing can have large impact on your system performance. Lucene’s bread and butter is the filesystem cache it uses for searching. During a merge you invalidate lots of disk-cache which is in turn not available for currently searched segments.  Once you open the index all data needs to be loaded into disk-cache again, field-caches need do be created, term-dictionaries loaded etc. and last but not least you are likely doubling the disk-space required to hold your index as old segments are still referenced while optimize is running.

Death to optimize, here comes forceMerge(1)

As I mentioned above, there is no IndexWriter#optimize() anymore in Lucene 3.5. If you do still want to explicitly invoke an optimize like merging you can use IndexWriter#forceMerge(int) where you can specify the maximum number of segments left after the merge finishes. The functionality is still there but we hope that fewer people feel like calling this together with each commit. If you use optimize extensively, think about it again and give your disks a break.