Imagine you are a Kindergarten teacher and a whole bunch of kids are playing with lego. Suddenly it's almost 4pm and the big mess needs to be cleaned up, so you ask each kid to pick up one lego brick and put it in your hands. They all run around, bringing bricks to you one by one concurrently. Once your hands are full, you turn around, sort them by color and carefully put them in the right colored piles.
While you are sorting the bricks, kids are still arriving and they must waitfor you to turn back around to accept more bricks. You turn around, and theprocess repeats, until all the bricks are piled up behind you merge themtogether into one pile and put them in to the box where they belong- Closingtime!
This is roughly how Lucene indexing worked until yesterday: several threadscoming in carrying documents that are indexed in a in-memory segment and oncewe reach a certain threshold (maxBufferedDocs or ramMaxBufferSizeMB) we mergethose in memory segments together and write them to disk. Eventually those'flushed' segments are merged together in the background.
Back to the Kindergarten... your algorithm doesn't really scale up if you suddenly have twice as many kids to pick up bricks since most of them will wait for you sorting the bricks and putting them on piles . What could you do to make it more efficient not just from an educational point of view?
Instead of letting kids wait for you to sort, you could tell each kid to pile the bricks up in their hands and once they have enough bricks collected they can sort them themselves and put them into a yet smaller pile next to you on the ground. While kids arrive dropping their piles on the ground you pick those piles and make bigger ones out of them while your kids go off collecting more lego. Voila! - this procedure scales up very well especially if your co-worker helps you merging the small piles into bigger ones and your kids can leave on time!
Now this is how Lucene indexing works as of yesterday. Instead of writing several in-memory segments and merge them together on flush Lucene now uses a DocumentsWriterPerThread that can write and flush its own private segment to persistent storage. For the interested reader I explained this in a more technical manner in my previous post on this blog.
Unfortunately this was one part of the game which worked well for the task of cleaning up all lego bricks or indexing a bunch of documents. The problem here is that you could end up with duplicate documents in the index or multiple bricks of the same size and color. Lucene like other index engines offers an update procedure where you have to specify some kind of identifier so that IndexWriter knows which documents need updating.
Lockless concurrent updates with DocumentsWriterPerThread
Let me quickly explain how updates work in Apache Lucene.... Internally, an update is an atomic operation that simply adds a document and deletes all previously added documents with the same key (delete term) . This guarantees that updating a document either in memory or already flushed and/or committed to disk at the current point in time, see the corresponding delete. To make that work, Lucene maintains all delete terms and their total order with respect to the corresponding add. OK let's go back to the Kindergarten and see how that worked until yesterday...
Next morning in the Kindergarten all kids are back and keen to play some games. Some of the little ones already pulled out all the lego bricks and spread them all over the room. For your game you need to have a set of distinct lego bricks one of each color and shape. You ask the kids to go off and collect all the lego bricks and put them in your hands, again like the first example above.
Each time a kid arrives you check if such a brick already exists in your hand and if so you drop all existing ones. Hence, you need to make sure that while you are checking the existing bricks no other kid puts a brick in your hand until you are done. Once you are done checking you keep a list of bricks that must be removed from the piles behind you when you put your current pile down to the ground. Eventually, you merge your piles together and end up with a set of distinct lego bricks in your box. Done, now you can start playing the game and the floor also looks pretty cleaned up already - Nice side-effect!
Let's apply this for our new, more efficient procedure. Now each kid collects the bricks in their hands and checks for duplicates themselves. But hey each kid must also take care of the bricks in all the other kids hands as well as the bricks that already have been piled up on the ground. Phew, that already seems like a bottleneck! We somehow need to guarantee that all kids collecting bricks communicate that new picked up brick to all other kids in the room atomically so that no other kid misses the picked up brick. This is well known as the critical section in this scenario.
Applied to Lucene's update mechanism this means that once a DocumentsWriterPerThread updates a document it needs to tell all other DocumentsWriterPerThread the delete term for this update and it needs to make sure that no other DocumentsWriterPerThread enters the critical section until it is done. The simplest solution would be to stop all threads, push the delete term to all other writers and continue but that is not ideal in an concurrent environment, right? So let's see how we solve this in the Kindergarten.
Since each kid only needs to check which bricks have been collected by others when she picks up a new one we only need to maintain all bricks added for each child. To make this little easier for us we tell every child to maintain that information themselves. Well since kids can easily lose track of this information we need to make it easy for them to remember so we tell them to draw the shape and the color of each brick they pick up on a little paper and put it in a row on the ground one after another, and remember where they put the piece of paper.
Once they pick up a new brick they put down a new piece of paper and check every drawing on the ground from the one they put down the last time until the one they just put down and remove all bricks from their hands that match the drawings they looked at. Once this is done they only need to remember their last drawing again. Essentially each kid can constantly collect bricks without blocking others. In a concurrent environment constant throughput is what you want, right?
In Lucene we use a single linked list of delete terms for this purpose, where each DocumentsWriterPerThread only maintains its private slice. Once we update a document the writer updates its slice of deletes by atomically picking the tail of the linked list and assigning it to the tail of the delete slice. Once the slice is updated it iterates through the deletes slice by following the next pointer of each element in the slice and applies the delete until the end of the slice is reached.
Figure 1 shows two DocumentsWriterPerThread (DWPT) where DWPT A already updated a document with delete term A. DWPT B is currently indexing a document with delete term B. Both have initialized their delete slices and see delete term A. Once DWPT B finishes its document and updates its delete slice the head points to delete term A while the tail has moved to delete term B. (Figure 2) DWPT B can now safely iterate through the delete slice until it reaches the tail even if the delete term B's next pointer has been advanced by another thread.
The neat part about this is that this DeleteQueue is entirely composed of CAS operations which will neither block any thread when updating the private delete slice nor when updating the tail of the DeleteQueue borrowed from Michael-Scott nonblocking queue algorithm (here is a nice article explaining the insert operation).
Kindergarten again: now we only have to take care of cleaning up all the small papers on the ground so we ask our co-worker to help us and pick all drawings that are not needed anymore. In Java we have a garbage collector that will free all unused memory for us we only need to make sure that we don't reference unneeded object anymore. The good new here is that for our update mechanism in Lucene we are only interested in the tail of the queue an not in the head since each DocumentsWriterPerThread maintains its own head. The actual DeleteQueue doesn't need to hold a reference to the head, which allows us to simply rely on the garbage collection to clean up any unnecessary delete queue items.
DocumentsWriterPerThread is a big step towards Lucene 4.0 and opens up lots of possibilities for future improvements. One of them will be Realtime-Search with Apache Lucene as Michael Busch describes here and here. We'll all work hard to fold those into Lucene one day!
Mike McCandless recently added nightly performance benchmarks for Apache Lucene to catch performance regression early. You can see the nice jump yesterday when we committed the DocumentsWriterPerThread refactoring to trunk. 186 GB plain text / hour is nice!