Apache Lucene turned 10 last year with a limitation that bugged many many users from day one. You may know Lucene’s core scoring model is based on TF/IDF (Vector Space Model). Lucene encapsulates all related calculations in a class called Similarity. Among pure TF/IDF factors Similarity also provides a norm value per document that is, […]
Simon says: optimize is bad for you….
In the upcoming Apache Lucene 3.5 release we deprecated an old and long standing method on the IndexWriter. Almost everyone who has ever used Lucene knows, IndexWriter#optimize() – I expect a lot of users to ask why we did this, well this is one of the reasons I wrote this blog. Let me go back a […]
Apache Lucene FlexibleScoring with IndexDocValues
During GoogleSummerOfCode 2011 David Nemeskey, PhD student, proposed to improve Lucene’s scoring architecture and implement some state-of-the-art ranking models with the new framework. Prior to this and in all Lucene versions released so far the Vector-Space Model was tightly bound into Lucene. If you found yourself in a situation where another scoring model worked better for your […]
Introducing Lucene Index Doc Values
From day one Apache Lucene provided a solid inverted index datastructure and the ability to store the text and binary chunks in stored field. In a typical usecase the inverted index is used to retrieve & score documents matching one or more terms. Once the matching documents have been scored stored fields are loaded for the top N […]
Lucene PMC Otis Gospodnetić at Berlin Buzzwords 2011
Some of you might have attended BerlinBuzzwords 2011 – yet again an awesome conference for people interested in topics around Search, Store and Scale. Beside awesome talks we also had some volunteer students that interviewed some of the speakers. We have published these interviews with the videos which give them the visibility they deserve. So […]
Apache Lucene in Google Summer of Code – The Apache Way
In 2011 Google invited open source project around the globe for its 7th Google Summer of Code: “Google Summer of Code is a global program that offers student developers stipends to write code for various open source software projects. We have worked with several open source, free software, and technology-related groups to identify and fund several […]
Lucene indexing gains concurrency
Imagine you are a Kindergarten teacher and a whole bunch of kids are playing with lego. Suddenly it’s almost 4pm and the big mess needs to be cleaned up, so you ask each kid to pick up one lego brick and put it in your hands. They all run around, bringing bricks to you one […]
Gimme all resources you have – I can use them!
Exploiting full IO and CPU concurrency when indexing with Apache Lucene During the last year Apache Lucene has been improved an extreme amount with outstanding improvements such as 100 times faster FuzzyQueries, new Term-Dictionary implementation, enhanced Segment-Merging and the famous Flexible-Indexing API. Recently I started working on another fundamental change referred to as DocumentsWriterPerThread, an […]