Exploiting full IO and CPU concurrency when indexing with Apache Lucene
During the last year Apache Lucene has been improved an extreme amount with outstanding improvements such as 100 times faster FuzzyQueries, new Term-Dictionary implementation, enhanced Segment-Merging and the famous Flexible-Indexing API. Recently I started working on another fundamental change referred to as DocumentsWriterPerThread, an extensive IndexWriter refactoring: Code that defines indexing performance though most Lucene users never have contact with it directly. Let me provide you with a brief introduction:
During indexing Lucene builds an in-memory index before it flushes the index structures to a persistent storage. Internally, the indexer builds up several smaller index segments in memory and merges them together once a flush is needed. Those smaller index segments are built concurrently provided the indexer is used by multiple threads in parallel. The IndexWriter uses a DocumentsWriter that selects a thread-private data structure for an incoming indexing thread which in turn inverts the document into the in memory segment. (see Fig. 1)
This model allows full CPU concurrency until the IndexWriter must flush the in-memory segments to the directory (a low level abstraction on top of Java's file system API). Once Lucene starts flushing we need to stop all threads and wait until the flushing thread has finished writing the segment to disk. This implementation in Lucene 3.0 essentially is a stop-the-world model preventing any indexing thread from making progress during a flush. Especially on slow IO systems or when indexing tremendous amounts of data, this limitation can become a serious bottleneck.
The DocumentsWriterPerThread (DWPT) refactoring, currently developed in a branch, tries to remove this limitation and exploit full CPU and IO concurrency during indexing. Instead of merging in memory data structures during flush and writing a single segment, each DocumentsWriterPerThread writes its own private segments. This allows us to flush DWPT concurrently without preventing concurrent indexing threads from making progress.
The idea for this refactoring has been around for a while initially brought up by Michael Busch (Twitter) in a Realtime-Search context and has been developed in a branch since June 2010. I recently worked together with Mike McCandless on adding one of the missing pieces to start benchmarking and eventually merge with trunk. Earlier this week I committed the a new FlushPolicy that controls when a DocumentsWriterPerThread starts flushing its segment to disk. The global DocumentsWriter visits the policy for every change and tries to pull a flush-pending DWPT once the policy returns. If a DWPT must flush DocumentsWriter swaps the pending DWPT with a fresh one and starts flushing the segment.
For the initial version of Lucene's default FlushPolicy we decided to only mark the largest DWPT as pending once the global active indexing memory exceeds the configured RAM buffer size and take the pending DWPT out of the active indexing memory once marked as pending.
With this model we guarantee that there are always enough DWPT available for indexing while at the same time IO resources are utilized without blocking indexing throughput. This sounds very exciting in theory but we haven't seen it performing at all neither worse nor better so it is time to give it a shot in an experiment.
Benchmarking Lucene Indexing
Among Lucene devs we use a tool on Apache-Extras called LuceneUtil to run all kinds of search performance benchmarks to ensure that our patches don't have any negative impact on the search performance. To efficiently benchmark the indexing process I added some statistics like ingest rates (Documents / Second ) and flush throughput rate to show how the DWPT refactoring behaves.
We traditionally use a Wikipedia English XML export which uncompresses to 21 GB of plain text. I pointed LuceneUtil to a clean Lucene trunk checkout as the base line and used the Realtime-branch as its competitor. I kicked off a 10 M documents indexing run on a 2x 6 core Xeon Box with 24GB RAM and and a 500GB Hitachi HDD which blasted those 10M into a Lucene index in 13 min and 40 seconds, not bad!
A snapshot of trunk's ingest rate between 50 seconds after start and 200 seconds shows a nice peak performance of about 40k documents per second on average and confirms the theory that we are making zero progress while we are flushing to disk. Flushing takes quite a bit of time and keeps your IO system busy.
Yet, I knew what to expect from trunk so the exciting benchmark run was still left to do. As soon as I looked at the charts I knew that all the effort payed off and we are actually faster, but the real question was how good are we doing in terms of resource utilization. The run returned in 6 min and 15 seconds - WOW this is not even 50% of the time compared to trunk!
These are amazing results, we have threads constantly adding documents to the index while flushes are running in the background. The peaks still seem to be little lower than on trunk which is most likely due to a thread flushing in the background since we simply swap DWPT once they must flush and hijack the indexing thread to do the flush.
Looking at the flush-throughput (Fig. 5) shows that doing flushes concurrently pays off very well in terms of IO utilization. DWPT is constantly using IO with little overlaps which seems to be an indicator that the disk can not keep up with the flushes. This also explains the differences in documents / seconds in Figure 4, if flushes overlap hijacked indexing threads can not make progress so ingest rate drops.
The overall results for DWPT seem amazing and given the fact that we haven't really started optimizing it promises even further improvements along those lines. That said, I'm curious how this change can improve indexing speed on Hadoop together with our new AppendingCodec that allows to write to HDFS directly. Concurrent flushing should provide good improvements here too. Stay tuned!