Nowadays almost every website has a full text search box as well as the auto suggestion feature in order to help users to find what they are looking for, by typing the least possible number of characters possible. The example below shows what this feature looks like in Google. It progressively suggests how to complete […]
Query time joining in Lucene
Recently query time joining has been added to the Lucene join module in the Lucene svn trunk. The query time joining will be included in the Lucene 4.0 release and there is a possibility that it will also be included in Lucene 3.6. Lets say we have articles and comments. With the query time join […]
Simon says: Single Byte Norms are Dead!
Apache Lucene turned 10 last year with a limitation that bugged many many users from day one. You may know Lucene’s core scoring model is based on TF/IDF (Vector Space Model). Lucene encapsulates all related calculations in a class called Similarity. Among pure TF/IDF factors Similarity also provides a norm value per document that is, […]
Berlin Buzzwords 2012
Yes, Berlin Buzzwords is back on the 4th & 5th June 2012! This really is only conference for developers and users of open source software projects, focusing on the issues of scalable search, data-analysis in the cloud and NoSQL-databases. All the talks and presentations are specific to three tags; “search”, “store” and “scale”.. Looking back […]
Apache Lucene & Solr 3.5.0
Just a little over two weeks ago Apache Lucene and Solr 3.5.0 were released. The released artifacts can be found here and here respectively. As part of the Lucene project’s effort to do regular releases, 3.5.0 is another solid release providing a handful of new features and bugs. The following is a review of the […]
Analysing European Languages With Lucene
It seems more and more often these days that search applications must support a large array of European languages. Part of supporting a language is analysing words to find their stem or root form. An example of stemming is the reduction of the words “run”, “running”, “runs” and “ran” to their stem “run”. In the […]
Compromise is hard
Whenever I talk my job with friends who are also IT professionals, the most commonly desired aspect is that I get to work in a community where everybody has a voice. Apache Software Foundation projects like Solr and Lucene tend to work from the motto that if it didn’t happen on the mailing list, it […]
Simon says: optimize is bad for you….
In the upcoming Apache Lucene 3.5 release we deprecated an old and long standing method on the IndexWriter. Almost everyone who has ever used Lucene knows, IndexWriter#optimize() – I expect a lot of users to ask why we did this, well this is one of the reasons I wrote this blog. Let me go back a […]
Apache Lucene FlexibleScoring with IndexDocValues
During GoogleSummerOfCode 2011 David Nemeskey, PhD student, proposed to improve Lucene’s scoring architecture and implement some state-of-the-art ranking models with the new framework. Prior to this and in all Lucene versions released so far the Vector-Space Model was tightly bound into Lucene. If you found yourself in a situation where another scoring model worked better for your […]
IndexDocValues – their applications
From a user’s perspective Lucene’s IndexDocValues is a bunch of values per document. Unlike Stored Fields or FieldCache, the IndexDocValues’ values can be retrieved quickly and efficiently as Simon Willnauer describes in his first IndexDocValues blog post. There are many applications that can benefit from using IndexDocValues for search functionality like flexible scoring, faceting, sorting, […]