Trifork Blog

Category ‘Apache Lucene’

What’s so cool about elasticsearch?

September 25th, 2012 by

elasticsearch Whenever there’s a new product out there and you start using it, suggest it to customers or colleagues, you need to be prepared to answer this question: “Why should I use it?”. Well, the answer could be as simple as “Because it’s cool!”, which of course is the case with elasticsearch, but then at some point you may need to explain why. I recently had to answer the question, “So what’s so cool about elasticsearch?”, that’s why I thought it might be worthwhile sharing my own answer in this blog.

Read the rest of this entry »

Elasticsearch beyond “Big Data” – running elasticsearch embedded

September 13th, 2012 by

elasticsearchTrifork has a long track record in doing project, training and consulting around open source search technologies. Currently we are working on several interesting search projects using elasticsearch. Elasticsearch is an open source, distributed, RESTful, search engine built on top of Apache Lucene. In contrast to for instance Apache Solr, elasticsearch is built as a highly scalable distributed system from the ground up, allowing you to shard and replicate multiple indices over a large number of nodes. This architecture makes scaling from one server to several hundreds a breeze. But, it turns out elasticsearch is not only good for what everyone calls “Big Data”, but it is also very well suited for indexing only small amounts of documents and even running elasticsearch embedded within an application, while still providing the flexibility to scale up later when needed.

Read the rest of this entry »

Berlin Buzzwords 2012 Recap

June 7th, 2012 by

This is a recap of Berlin Buzzwords 2012, the 2 day conference on everything scale, search and store in the NoSQL world. Myself and Martijn van Groningen arrive in Berlin Sunday evening. Unfortunately we are too late for the infamous Barcamp, a low-key mix of lightning talks, beer and socializing, so we decide to have a beer at the hotel instead.

Day 1

So let’s fast forward to Monday morning. Berlin Buzzwords kicks off with this year over a record 700 delegates from all over the world. The first keynote session was by Leslie Hawthorn on community building. Using gardening as a metaphor she talks us through slides of gardening tools, plucking weeds and landscaping as activities for building a healthy community. Her advice, display all the ‘paths to success’ on the site and how to nip in the bud unproductive mailinglist discussions using a project ‘mission statement’. My mind wanders off as I think about how I can apply this to Apache Whirr. Then I think about the upcoming release that I still want to do testing for. “Community building is hard work”, she says. Point taken.

Storm and Hadoop

Ted Dunning gave an entertaining talk on closing the gap between the realtime features of Storm and the batch nature of Hadoop. Besides that he talked about an approach called Bayesian Bandits for doing real-time A/B testing while he casually performs a coin trick to explain the statistics behind it.

Socializing, coding & the geek-play area

During the rest of the day I visited lots of talks but I occasionally socialize and hang out with my peers. In the hallway I bump into old friends and I meet many new people from the open source community. At other times we wander outside to the geek-and-play area watching people playing table tennis and enjoying the inspiring and laid-back vibe of this pretty unique conference.

Occasionally I feel an urge to do some coding. “You should get a bigger screen”, a fellow Buzzworder says as he points at my laptop. I snap out of my coding trance and realize I’m hunched over my laptop, peering at the screen from up close. I’m like most people at the conference, multi-tasking between listening to a talk, intense coding, and sending out a #bbuzz Tweet. There are just so many inspiring things that grab your attention, I can hardly keep up with them.

Mahout at Berlin Buzzwords

Almost every developer from the Mahout development team hangs out at Buzzwords. As I gave my talk, Robin Anil and Grant Ingersoll type up last-minute patches and close the final issues for the 0.7 Mahout release. On stage I discuss the Whirr’s Mahout service which automatically installs Mahout on a cluster. In hindsight the title didn’t fit that well as I mostly talked about Whirr, not Machine Learning with Mahout. Furthermore, my talk finished way earlier than expected; bad Frank. Note to self: better planning next time.

Day 2


Day two features quite a few talks on ElasticSearch. In one session Shay Banon, ElasticSearch’ founder, talks about how his framework handles ‘Big Data’ and he explains the design principles like shard overallocation, performance penalties of splitting your index, and so on. In a different session Lukáš Vlček and Karel Minarik fire up an ElasticSearch cluster during their talk. The big screen updates continuously with stats on every node in the cluster. The crowd cracks up laughing as Clinton Gormley suddenly joins the live cluster using his phone and laptop.

New directions in Mahout

In the early afternoon Simon Willnauer announced that one of the talks had to be cancelled, luckily Ted Dunning could fill this spot by giving another talk. This time the subject was the future of Mahout. He discussed the upcoming 0.7 release which is largely a clean-up and refactoring effort. Additionally he discussed two future contributions: pig-vector and the streaming K-means clustering algorithm.

The idea of pig-vector is to create a glue layer between Pig and Mahout. Currently you run Mahout by first transforming your data, say a directory of text files, to a format that can Mahout can read and then you run the actual Mahout algorithms. The goal for pig-vector is to use Pig to shoehorn your data so Mahout can use it. The benefit of using Pig is that is designed to read data from a lot of sources so Mahout can focus on the actual machine learning.

The upcoming streaming K-means clustering algorithm looks very promising. “No knobs, the system adapts to the data” he says. Ted refers to the problem with existing Mahout algorithms that require a lot tweaking of parameters. On top of that the algorithm is blazingly fast too, using clever tricks like projection search. Even though the original K-means algorithm is easily parallelizable using MapReduce, the downside is that it is iterative and has to make several passes over the data. Very inefficient for large datasets.

See you next year!

This wraps up my coverage of Berlin Buzzwords. There were many more talks and of course so I just covered my personal interest area which amounts largely towards Mahout. The only downside was the weather in Berlin this year but other than that I really enjoyed the conference. Many thanks to everyone involved in organizing the conference and roll on 2013!

On Schemas and Lucene

April 17th, 2012 by

One of the very first thing users encounter when using Apache Solr is its schema. Here they configure the fields that their Documents will contain and the field types which define amongst other things, how field data will be analyzed. Solr’s schema is often touted as one of its major features and you will find it used in almost every Solr component. Yet at the same time, users of Apache Lucene won’t encounter a schema. Lucene is schemaless, letting users index Documents with any fields they like.

To me this schemaless flexibility comes at a cost. For example, Lucene’s QueryParsers cannot validate that a field being queried even exists or use NumericRangeQuerys when a field is numeric. When indexing, there is no way to automate creating Documents with their appropriate fields and types from a series of values. In Solr, the most optimal strategies for faceting and grouping different fields can be chosen based on field metadate retrieved from its schema.

Consequently as part of the modularisation of Solr and Lucene, I’ve always wondered whether it would be worth creating a schema module so that Lucene users can benefit from a schema, if they so choose. I’ve talked about this with many people over the last 12 months and have had a wide variety of reactions, but inevitably I’ve always come away more unsure. So in this blog I’m going ask you a lot of questions and I hope you can clarify this issue for me.

So what is a schema anyway?

Before examining the role of a schema, it’s worthwhile first defining what a schema is. So to you, what is a schema? and what makes something schemaless?

According to Wikipedia, a schema in regards to a database is “a set of formulas called integrity constraints imposed on a database”. This of course can be seen in Solr. A Solr schema defines constraints on what fields a Document can contain and how the data for those fields must be analyzed. Lucene, being schemaless, doesn’t have those constraints. Nothing in Lucene constrains what fields can be indexed and a field could be analyzed in different ways in different Documents.

Yet there is something in this definition that troubles me. Must a schema constrain? or can it simply be informative? Or put another way, if I index a field that doesn’t exist in my schema, must I get an error? If a schema doesn’t constrain, is it even a schema at all?

Field Name Driven vs. Data Type Driven

Assuming we have a schema, whether it constrains or not, how should it be oriented? Should it follow the style of databases where you state per field name the definition of that field, or should it use datatypes instead where you configure for, say, numeric fields, their definition?

The advantage of being field name driven is that it gives you fine grained control over each field. Maybe field X is text but should be handled differently to another text field Y. If you only have a single text datatype then you wouldn’t be able to handle the fields differently. It also simplifies the interaction with the schema. Anything needing access to how a field should be handled can look up the information directly using the field’s name.

The disadvantage of the field name driven approach is that it is the biggest step away from the schemaless world. A definition must be provided for every field and that can be cumbersome for indexes containing hundreds of fields, when the schema must be defined upfront (see below) or when new fields need to be constantly defined.

The datatype driven approach is more of a middle ground. Yes the definition for each datatype must be defined, but it wouldn’t matter how many actual fields were indexed as long as they mapped to a datatype in the schema. At the same time this could increase the difficulty of using the schema. There wouldn’t be any list of field names stored in the schema. Instead users of the schema would need to infer the datatype of a field before they could access how the field should be handled. Note, work on adding something along these lines to Solr has begun in SOLR-3250.

What do you think is best? Do you have other ideas how a schema could be structured?

Upfront vs. Incremental

Again assuming we have a schema, whether it be field name or datatype driven, should we expect the schema to be defined upfront before it’s used, or should it be able to be built incrementally over time?

The advantage of the schema being upfront is that is considerably reduces the complexity of the schema implementation. There is no need to support multi-threaded updates or incompatible changes. However it is also very inflexible, requiring all the fields ever to be used in the index be known before any indexing begins.

An incrementally created schema is the opposite of course since you can start from a blank slate and add definitions when you know them. This means a schema can evolve along with an index. Yet as mentioned above, it can be more complex to implement. Issues of how to handle multiple threads updating the schema and incompatible changes to the schema arise. Furthermore, where with an upfront schema you could ensure that when a field is used it will have a definition in the schema, with an incremental schema it may be that a field is accidentally used before its definition is added. Should this result in an error?

It may seem as though Solr requires its schemas be defined upfront. However in reality, Solr only requires a schema be defined when it brings a core online and prevents any changes while its online. When the core is taken offline, its schema can be edited. In SOLR-3251 ideas on how to add full incremental schema support to Solr are being discussed.

Storage: External vs. Index

No matter whether a schema is defined upfront or incrementally built, it will need to be stored somewhere. Solr stores its schema externally in its schema.xml file. This decouples the schema from the index itself since the same schema could, in theory, be used for multiple indexes. Changes to an external schema do not necessarily have to be impact an index (and vice versa), and an external schema doesn’t impact the loading of an index.

At the same time, the disconnect between an externally stored schema and an index means that they could fall out of sync. An index could be opened and new fields added without the schema being notified. Removing the definition of a field in the schema wouldn’t necessarily mean that field would be removed from the index.

One way to address this is to store the schema inside the index itself. Lucene already partially does this, having a very simple notion of FieldInfo. There has been considerable reluctance to increasing what’s stored in FieldInfo since it will slow down the incredibly efficient loading of indexes. How slow it would become would depend on how much data was stored in the schema. Yet this would ensure that the schema and the index were synchronized. Any changes to one would be represented in the other.

Given how controversial storing a schema in an index would be, do you think its worthwhile? Have you encountered synchronisation issues between your indexes and your schemas? Would you prefer control over where your schema were stored, allowing you to choose Cloud storage or maybe another database?

Does any of this even matter?

You might be thinking that I’ve totally wasted my time here and that actually there is no need for a schema module in Lucene. It could be argued that while having a schema is one of Solr’s major features, being schemaless is one of Lucene’s and that it should stay that way. Maybe that it’s best left up to Lucene users to create their own custom schema solutions if they need one. What do you think? Do you have some sort of Schema notion in your Lucene applications? If so, how does it work? If you use Solr, do you like how its schema works? If you could change anything, what would you change? I’d love to hear your thoughts.

There’s More Lucene in Solr than You Think!

April 11th, 2012 by

We’ve been providing Lucene & Solr consultancy and training services for quite a few years now and it’s always interesting to see how these two technologies are perceived by different companies and their technical people. More precisely, I find it interesting how little Solr users know about Lucene and more so, how unaware they are how important it is to to know about it. A quite reoccurring pattern we notice is that companies, looking for a cheap and good search solution, hear about Solr and decide to download and play around with it a bit. This is usually done within a context of a small PoC to eliminate initial investment risks. So one or two technical people are responsible for that, they download Solr distribution, and start following the Solr tutorial that is published on the Solr website. They realize that it’s quite easy to get things up and running using the examples Solr ships with and very quickly decide that this is the right way to go. So what the do next? They take their PoC codebase (including all Solr configurations) and slightly modify and extend them, just to support their real systems, and in no time, they get to the point were Solr can index all the data and then serve search requests. And that’s it… they roll out with it, and very often just put this in production. It is then often the case that after a couple of weeks we get a phone call from them asking for help. And why is that?

Examples are what they are – Just examples

I always argued that the examples that are bundled in the Solr distribution serve as a double edge sword. On one hand, they can be very useful just to showcase how Solr can work and provide good reference to the different setups it can have. On the other hand, it gives this false sense of security that if the examples configuration are good enough for the examples, they’ll be good enough for the other systems in production as well. In reality, this is of course far from being the case. The examples are just what they are – examples. It’s most likely that they are far from anything you’d need to support your search requirements. Take the Solr schema for example, this is one of the most important configuration files in Solr which contributes many of the factors that will influence the search quality. Sure, there are certain field types which you probably can always use (the primitive types), but when it comes to text fields and text analysis process – this is something you need to look closer at and in most cases customize to your needs. Beyond that, it’s also important to understand how different fields behave in respect to the different search functionality you need. What roles (if at all) can a field play in the context of these functionalities. For some functionalities (e.g. free text search) you need the fields to be analyzed, for other (e.g. faceting) you don’t. You need to have a very clear idea of these search functionalities you want to support, and based on that, define what normal/dynamic/copy fields should be configured. The examples configurations don’t provide you this insight as they are targeting the dummy data and the examples functionality they are aimed to showcase – not yours! And it’s not just about the schema, the solrconfig.xml in the examples is also much too verbose than you actually need/want it to be. Far too many companies just use these example configurations in their production environment and I just find it a pity. Personally, I like to view these configuration files also serving as some sort of documentation for your search solution – but by keeping them in a mess, full of useless information and redundant configuration, they obviously cannot.

It’s Lucene – not Solr

One of the greater misconceptions with Solr is that it’s a product on its own and that reading the user manual (which is an overstatement for a semi-structured and messy collection of wiki pages), one can just set it up and put it in production. What people fail to realize is that Solr is essentially just a service wrapper around Lucene, and that the quality of the search solution you’re building, largely depends on it. Yeah, sure… Solr provide important additions on top of Lucene like caching and few enhanced query features (e.g. function queries and dismax query parser), but the bottom line, the most influential factors of the search quality lays deep down in the schema definition which essentially determines how Lucene will work under the hood. This obviously requires proper understanding of Lucene… there’s just no way around it! But honestly, I can’t really “blame” users for getting this wrong. If you look at the public (open and commercial) resources that companies are selling to the users, they actually promote this ignorance by presenting Solr as a “stands on its own” product. Books, public trainings, open documentations, all hardly discuss Lucene in detail and instead focus more on “how you get Solr to do X, Y, Z”. I find it quite a shame and actually quite misleading. You know what? I truly believe that the users are smart enough to understand – on their own – what parameters they should send Solr to enable faceting on a specific field…. common… these are just request parameters so let them figure these things out. Instead, I find it much more informative and important to explain to them how faceting actually works under the hood. This way they understand the impact of their actions and configurations and are not left disoriented in the dark once things don’t work as they’d hoped. For this reason actually, we designed our Solr training to incorporate a relatively large portion of Lucene introduction in it. And take it from me… our feedback clearly indicate that the users really appreciate it!


There you have it… let it sink in: when downloading Solr, you’re also downloading Lucene. When configuring Solr, you’re also configuring Lucene. And if there are issues with Solr, they are often related to Lucene as well. So to really know Solr, do yourself a favor, and start getting to know Lucene! And you don’t need to be a Java developer for that, it’s not the code itself that you need to master. How Lucene works internally, on a detailed yet conceptual level should be more than enough for most users.

Result grouping made easier

March 26th, 2012 by

Lucene has result grouping for a while now as a contrib in Lucene 3.x and as a module in the upcoming 4.0 release. In both releases the actual grouping is performed with Lucene Collectors. As a Lucene user you need to use various of these Collectors in searches. However these Collectors have many constructor arguments. So they can become quite cumbersome to use grouping in pure Lucene apps. The example below illustrates this.

Result grouping using the grouping collectors directly
TermFirstPassGroupingCollector c1 = new TermFirstPassGroupingCollector("author", groupSort, groupOffset+topNGroups);

  boolean cacheScores = true;
  double maxCacheRAMMB = 4.0;
  CachingCollector cachedCollector = CachingCollector.create(c1, cacheScores, maxCacheRAMMB); TermQuery(new Term("content", searchTerm)), cachedCollector);

  Collection<SearchGroup<BytesRef>> topGroups = c1.getTopGroups(groupOffset, fillFields);

  if (topGroups == null) {
  // No groups matched

  boolean getScores = true;
  boolean getMaxScores = true;
  boolean fillFields = true;
  TermSecondPassGroupingCollector c2 = new TermSecondPassGroupingCollector("author", topGroups, groupSort, docSort,
  docOffset+docsPerGroup, getScores, getMaxScores, fillFields);

  TermAllGroupsCollector allGroupsCollector = new TermAllGroupsCollector("author");
  c2 = MultiCollector.wrap(c2, allGroupsCollector);

  if (cachedCollector.isCached()) {
  // Cache fit within maxCacheRAMMB, so we can replay it:
  } else {
  // Cache was too large; must re-execute query: TermQuery(new Term("content", searchTerm)), c2);

  TopGroups<BytesRef> groupsResult = c2.getTopGroups(docOffset);
  groupsResult = new TopGroups<BytesRef>(groupsResult, allGroupsCollector.getGroupCount());

  // Render groupsResult...

In the above example basic grouping with caching is used and also the group count is retrieved. As you can see there is quite a lot of coding involved. Recently a grouping convenience utility has been added to the Lucene grouping module to alleviate this problem. As the code example below illustrates, using the GroupingSearch utility is much easier than interacting with actual grouping collectors.

Normally the document count is returned as hit count. However in the situation where groups are being used as hit, rather than a document the document count will not work with pagination. For this reason the group count can be used the have correct pagination. The group count returns the number of unique groups matching the query. The group count can in the case be used as hit count since the individual hits are groups.

Result grouping using the GroupingSearch utility
 GroupingSearch groupingSearch = new GroupingSearch("author");
groupingSearch.setCachingInMB(4.0, true);
TermQuery query = new TermQuery(new Term("content", searchTerm));
TopGroups<BytesRef> result =, query, groupOffset, groupLimit);
// Render groupsResult...
Integer totalGroupCount = result.totalGroupCount; // The group count if setAllGroups is set to true, otherwise this is null

The GroupingSearch utility is only added to trunk meaning that it will be released with the Lucene 4.0 release. If you can’t wait you can always use a nightly build or checkout the trunk yourself. It is important to keep in mind that the GroupingSearch utility uses the already existing grouping collectors to perform the actual grouping. The GroupingSearch utility has four different constructors for each grouping type. Grouping by indexed terms, function, doc values and doc block. The first one is used the example above. The rest is described below.

Result grouping by function
 FloatFieldSource field1 = new FloatFieldSource("field1");
FloatFieldSource field2 = new FloatFieldSource("field2");
SumFloatFunction sumFloatFunction = new SumFloatFunction(new ValueSource[]{field1, field2});
GroupingSearch groupingSearch = new GroupingSearch(sumFloatFunction, new HashMap<Object, Object>());
TopGroups<MutableValue> result =, query, 0, 10);

Grouping by function uses the ValueSource abstraction from the Lucene queries module, consequently the grouping module depends on the queries module. In the above example grouping is performed on the sum of field1 and field2. The group type in the result is always of type MutableValue when grouping by a function.

Result grouping by doc values
 boolean diskResident = true;
DocValues.Type docValuesType = DocValues.Type.BYTES_VAR_SORTED;
GroupingSearch groupingSearch = new GroupingSearch("author", docValuesType, diskResident);
TopGroups<BytesRef> result1 =, query, groupOffset, groupLimit);

// grouping by var int docvalues
DocValues.Type docValuesType = DocValues.Type.VAR_INTS;
GroupingSearch groupingSearch = new GroupingSearch("author", docValuesType
TopGroups<Long> result2 =, query, groupOffset, groupLimit);

Grouping by docvalues requires you to specify a DocValues.Type up front and whether the doc values should be read disk resident. It is important that the DocValues.Type is the same as was used when indexing the data. A different DocValues.Type can lead to different group type in the result as you can see in the above code sample. DocValues.Type.VAR_INTS results in a Long type and DocValues.Type.BYTES_VAR_SORTED in a ByteRef type.

Result grouping by doc block
 Filter lastDocInBlock = new CachingWrapperFilter(new QueryWrapperFilter(new TermQuery(new Term("groupend", "x"))));
GroupingSearch groupingSearch = new GroupingSearch(lastDocInBlock);
Query query = new TermQuery(new Term("content", "random"))
// Render groupsResult
TopGroups<?> result =, query, 0, 10);

Grouping by doc block requires you to specify a Filter marks the last document of each block. Obviously your data has to be indexed in a block using the IndexWriter’s addDocuments(…) method.

The GroupingSearch utility class doesn’t cover all use cases yet. It only works locally meaning it doesn’t help you with distributed grouping also it lacks a few features like grouped facets. I think this utility class is good start to make use of result grouping a bit easier than it was before. Work on making result grouping easier to use for pure Lucene apps hasn’t finished and features like distributed grouping will be made easier to use.

Have you used result grouping in your search solution either directly with Lucene or via Solr? Adding result grouping to your search solution on a large scale can be challenging! Let us know how you solved your requirements with result grouping by posting a comment.

Lucene Versions – Stable, Development, 3.x and 4.0

March 25th, 2012 by

With Solr and Lucene 3.6 soon becoming the last featureful 3.x release and the release of 4.0 slowly drawing near, I thought it might be useful just to recap what all the various versions mean to you the user and why two very different versions are soon going to be made available.

A Brief History of Time

Prior to Solr and Lucene 3.1 and the merger of the developments of both projects, both were developed using single paths. This meant that all development was done on trunk and all releases were made from trunk. Although a simple development model, there were two major problems:

  • In order to establish a stable codebase to create a release from, trunk would need to go into a code freeze for several weeks. For fast paced projects like Solr and Lucene, this only served to stall development and frustrate contributors and users alike.
  • With backwards compatibility between minor versions compulsory, it was near impossible for large backwards incompatible, but highly desirable, changes to be made. Any large scale changes generally got bogged down in attempts to provide a compatibility layer. One way round this was to make regular major version releases. However this often meant major releases were often rushed before they were ready in order to get some popular feature out to users.

The end result was that releases were very infrequent, sometimes rushed, and generally a mess.

With the merger of the development of Solr and Lucene and a reassessment of the development model, it was decided that the projects should follow other open source projects and use a multi-path model consisting of stable and development versions. Trunk at the stage was branched to create the stable 3.x branch and trunk was changed to be the development 4.x version.

This decision has had the following benefits:

  • Development of the 3.x branch could focus on stable backwards compatible changes and bug fixes.
  • Since 3.x is stable, releases could be made much more regularly and without having to stop the more wild development of trunk.
  • Development of trunk can be unimpeded by stable releases and doesn’t need to focus on maintaining backwards compatibility.

So what does this mean for you the user?

Solr and Lucene 3.6 will be the last major release from the stable 3.x branch. This means it will be the last major release backwards compatible with all previous 3.x versions. Any future 3.x release will be merely a bug fix release and given how successful the 3.x versions have been, I doubt there will be many if any of those. Sometime after the release of 3.6, trunk will be branched to create the new stable 4.x branch and a stable 4.0 release will be made. Trunk will then move onto being the next development version 5.x.

Consequently, if you’re using a 3.x version I recommend that you take the opportunity to upgrade the 3.6 when it is released so that you can make the most of being able to just drop the libraries into your application. If however you’re either using a pre 3.x version and are looking to upgrade or using a 3.x version but needing one or many of the amazing new features currently in trunk, I recommend that you hold on a little while longer until 4.0 becomes the new stable release.


The change in development model has meant that Solr and Lucene are available to make more regular and clean releases. With the end of the 3.x era drawing near, users will soon be able to get their hands on the arguably more powerful 4.x stable releases. In preparation for the (hopefully) soon release of 4.0, I will be giving you an introduction to some of its new and exciting features.

If you’ve already begun upgrading to 4.x, share with us your experiences. Or if you’re thinking of upgrading and need some assistance, drop us a line and we’ll see how we can be of assistance.

Using your Lucene index as input to your Mahout job – Part I

March 5th, 2012 by

This blog shows you how to use an upcoming Mahout feature, the lucene2seq program or This program reads the contents of stored fields in your Lucene index and converts them into text sequence files, to be used by a Mahout text clustering job. The tool contains both a sequential and MapReduce implementation and can be run from the command line or from Java using a bean configuration object. In this blog I demonstrate how to use the sequential version on an index of Wikipedia.


When working with Mahout text clustering or classification you preprocess your data so it can be understood by Mahout. Mahout contains input tools such as seqdirectory and seqemailarchives for fetching data from different input sources and transforming them into text sequence files. The resulting sequence files are then fed into seq2sparse to create Mahout vectors. Finally you can run one of Mahout’s algorithms on these vectors to do text clustering.

The lucene2seq program

Recently a new input tool has been added, lucene2seq, which allows you read from stored fields of a Lucene index to create text sequence files. This is different from the existing lucene.vector program which reads term vectors from a Lucene index and transforms them into Mahout vectors straight away. When using the original text content you can take full advantage of Mahout’s collocation identification algorithm which improves clustering results.

Let’s look at the lucene2seq program in more detail by running

$ bin/mahout lucene2seq --help

This will print out all the program’s options.

Job-Specific Options:                                                           
  --output (-o) output       The directory pathname for output.                 
  --dir (-d) dir             The Lucene directory                               
  --idField (-i) idField     The field in the index containing the id           
  --fields (-f) fields       The stored field(s) in the index containing text   
  --query (-q) query         (Optional) Lucene query. Defaults to               
  --maxHits (-n) maxHits     (Optional) Max hits. Defaults to 2147483647        
  --method (-xm) method      The execution method to use: sequential or         
                             mapreduce. Default is mapreduce                    
  --help (-h)                Print out help                                     
  --tempDir tempDir          Intermediate output directory                      
  --startPhase startPhase    First phase to run                                 
  --endPhase endPhase        Last phase to run

The required parameters are lucene directory path(s), output path, id field and list of stored fields. The tool will fetch all documents and create a key value pair where the key equals the value of the id field and the value equals the concatenated values of the stored fields. The optional parameters are a Lucene query, a maximum number of hits and the execution method, sequential or MapReduce. The tool can be run like any other Mahout tool.

Converting an index of Wikipedia articles to sequence files

To demonstrate lucene2seq we will convert an index of Wikipedia articles to sequence files. Checkout the Lucene 3x branch, download a part of the Wikpedia articles dump and run a benchmark algorithm to create an index of the articles in the dump.

$ svn checkout lucene_3x
$ cd lucene_3x/lucene/contrib/benchmark
$ mkdir temp work
$ cd temp
$ wget
$ bunzip enwiki-latest-pages-articles1.xml-p000000010p000010000.bz2

The next step is to run a benchmark ‘algorithm’ to index the Wikipedia dump. Contrib benchmark contains several of these algorithms in the conf directory. For this demo we only index a small part of the Wikipedia index so edit the conf/wikipediaOneRound.alg file so it points to enwiki-latest-pages-articles1.xml-p000000010p000010000. For an overview of the syntax of these benchmarking algorithms check out the benchmark.byTask package-summary Javadocs

Now it’s time to create the index

$ cd ..
$ ant run-task -Dtask.alg=conf/wikipediaOneRound.alg -Dtask.mem=2048M

The next step is to run lucene2seq on the generated index under work/index. Checkout the lucene2seq branch from Github

$ git clone
$ git checkout lucene2seq
$ mvn clean install -DskipTests=true

Change back to the lucene 3x contrib/benchmark work dir and run

$ <path/to>/bin/mahout lucene2seq -d index -o wikipedia-seq -i docid -f title,body -q 'body:java' -xm sequential

To create sequence files of all documents that contain the term ‘java’. From here you can run seq2sparse followed by a clustering algorithm to cluster the text contents of the articles.

Running the sequential version in Java

The lucene2seq program can also be run from Java. First create a LuceneStorageConfiguration bean and pass in the list of index paths, the sequence files output path, the id field and the list of stored fields in the constructor.

LuceneStorageConfiguration luceneStorageConf = new LuceneStorageConfiguration(configuration, asList(index), seqFilesOutputPath, "id", asList("title", "description"));

You can then optionally set a Lucene query and max hits via setters

luceneStorageConf.setQuery(new TermQuery(new Term("body", "Java")));

Now you can run the tool by calling the run method with the configuration as a parameter

LuceneIndexToSequenceFiles lucene2Seq = new LuceneIndexToSequenceFiles();; 


In this post I showed you how to use lucene2seq on an index of Wikipedia articles. I hope this tool will make it easier for you to start using Mahout text clustering. In a future blog post I discuss how to run the MapReduce version on very large indexes. Feel free to post comments or feedback below.

Different ways to make auto suggestions with Solr

February 15th, 2012 by

Nowadays almost every website has a full text search box as well as the auto suggestion feature in order to help users to find what they are looking for, by typing the least possible number of characters possible. The example below shows what this feature looks like in Google. It progressively suggests how to complete the current word and/or phrase, and corrects typo errors. That’s a meaningful example which contains multi-term suggestions depending on the most popular queries, combined with spelling correction.

Figure 1: The way Google makes auto complete suggestions: multi-term query suggestions and spelling correction

There are different ways to make auto complete suggestions with Solr. You can find many articles and examples on the internet, but making the right choice is not always easy. The goal of this post is compare the available options in order to identify the best solution tailored to your needs, rather than describe any one specific approach in depth.

It’s common practice to make auto-suggestions based on the indexed data. In fact a user is usually looking for something that can be found within the index, that’s why we’d like to show the words that are similar to the current query and at the same time relevant within the index. On the other hand, it is recommended to provide query suggestions; we can for example capture and  index on a specific solr core all the user queries which return more than zero results, so we can use those information to make auto-suggestions as well. What actually matters is that we are going to make suggestions based on what’s inside the index; for this purpose it’s not relevant if the index contains user queries or “normal data”, the solutions we are going to consider can be applied in both cases.

Some questions before starting

In order to make the right choice you should first of all ask yourself some questions:

  • Which Solr version are you working with? If we’re working with an old version (1.x for example) it is worth an upgrade. If you can’t upgrade you’ll probably have less options to choose from, unless you’re willing to manually apply some patches.
  • Do you want to make single term or multiple term suggestions? You should basically decide if you want to suggest single words which can complete the word the user has partially written, or even complete sentences.
  • Do you want to filter the suggestions based on the actual search? The user could have previously selected a facet entry, filtering his results to a specific subset. Every search should match with that specific context, so it is common practice to have the auto-suggestions reflect the user filters. Unfortunately some of the solutions we have available don’t support any filter.
  • How do you want to sort the auto-suggestions? It’s important to show on top the best suggestion, and each solution you are going to explore has a different sorting option.
  • Do you want to make auto-suggestions based on multivalued fields? Multivalued fields are for example commonly used for tags, since every document can have more than one tag and do you want to suggest a tag while the user is typing it.
  • Do you want to make auto-suggestions based on prefix queries or even infix queries? While it’s always possible to suggest words starting with a prefix, not all the solutions are able to suggest words that contain the actual query.
  • What’s the impact of each solution in terms of performance and index size? The answer depends on the index you’re working with and needs to take into account that some solutions can increase the index size, while all of them will affect performance.

Faceting using the prefix parameter

The first option we have is available in Solr 1.2 and based on a special facet that includes only the results starting with a prefix, which the user has partially typed, making use of the facet.prefix parameter. This solution works only for single term suggestions starting with a particular prefix (not infix) and you can sort results only alphabetically or by count. It works even with multi valued fields, and is possible to apply any filter queries to have the suggestions reflecting the current context of the search.

Use of NGrams as part of the analysis chain

The second solution is available from Solr 1.3 and relies on the use of NGramFilterFactory or EdgeNGramFilterFactory as part of the analysis chain. It means you’ll have a specific field which makes possible to search on it through wildcard queries, typing word fragments. Every word in the index will be split into several NGrams; you can reduce the number of NGrams (and the size of the index) by increasing the minGramSize parameter or switching to the EdgeNGramFilterFactory which works in only one direction, by default from the beginning edge of an input token. With NGramFilterFactory you can use infix and prefix queries, while with EdgeNGramFilterFactory only prefix queries. This looks like a really flexible way to make auto-suggestions since it relies on a specific field with its configurable processors chain. You can easily filter your results and have them sorted based on relevance, also using boosting and the eDisMax query parser. Furthermore, this solution is faster than the previous one. On the other hand, if we want to make auto-suggestions based on a field which contains many terms, we should consider that the index size will considerably increase since we are indexing for each term a number of terms equals to term length – minGramSize (using EdgeNGrams). This option would work even with multi valued fields, but the index size would obviously increase even more.

Use of the TermsComponent

One more solution, available from Solr 1.4, is based on the use of the TermsComponent, which provides access to the indexed terms in a field and the number of documents that match each term. This option is even faster than the previous one, you can make prefix queries using the terms.prefix parameter or infix queries using the terms.regex parameter available starting from Solr 3.1. Only single term suggestions are possible, and unfortunately you can’t apply any filter. Furthermore, user queries will not be analyzed in any way; you’ll have access to raw indexed data, which means you could have problems with whitespaces or case-sensitive queries, since you’ll be searching directly through the indexed terms.

Use of the Suggester

Due to the limitations of the above solutions, Solr developers have worked on a new component created exactly for this task. This option is the most recent and recommended one, available since Solr 3.1 and based on the SpellCheckComponent, the same you can use to make spelling correction. What’s new is the SolrSpellChecker implementation to make suggestions, called Suggester, which actually makes use of the lucene suggest module. All has started with the SOLR-1316 issue, based on which the Suggester was created. Then the collate functionality has been improved with the SOLR-2010 issue. After that, the task has been finalized with LUCENE-3135 by backporting to the 3.x branch the lucene suggest module, which is actually used from the Solr Suggester class. This solution has its own separate index which you can automatically build on every commit. Using collation you can have multi-term suggestions. Furthermore, it is possible to use a custom dictionary instead of the index content, which makes the current solution even more flexible.

Let’s summarize

The following table contains pros and cons for each solution I mentioned above, from the slowest to the fastest one. Even if the last option is the most flexible, it requires more tuning. Of course more power means also more responsibility, so if your requirements are just single term suggestions with filtering and you don’t have particular performance problems, the facet old fashioned way works perfectly out of the box.


Figure 2: Comparison table between the mentioned ways to make auto complete suggestions with Solr


This blog entry has hopefully shown you some ways in which you can use auto-suggestions with Solr and the related pros and cons. I hope this will help you in making the right choices from the beginning tailored to your requirements. Please do share any additional considerations I may not have covered and your experiences. Also, we’re intrigued to hear how you deal with the same problems in your search applications. Leave a comment or ask a question if you have any doubt too!

Query time joining in Lucene

January 22nd, 2012 by

Recently query time joining has been added to the Lucene join module in the Lucene svn trunk. The query time joining will be included in the Lucene 4.0 release and there is a possibility that it will also be included in Lucene 3.6.

Lets say we have articles and comments. With the query time join you can store these entities as separate documents. Each comment and article can be updates without re-indexing large parts of your index. Even better would be to store articles in an article index and comments in a comment index! In both cases a comment would have a field containing the article identifier.


In a relational database it would look something like the image above.

Query time joining has been around in Solr for quite a while. It’s a really useful feature if you want to search with relational flavor. Prior to the query time join your index needed to be prepared in a specific way in order to search across different types of data. You could either use Lucene’s index time block join or merge your domain objects into one Lucene document. However, with the join query you can store different entities as separate documents which gives you more flexibility but comes with a runtime cost.

Query time joining in Lucene is pretty straight forward, and entirely encapsulated in JoinUtil.createJoinQuery. It requires the following arguments:

  1. fromField. The from field to join from.
  2. toField. The to field to join to.
  3. fromQuery. The query executed to collect the from terms. This is usually the user specified query.
  4. fromSearcher. The search on where the fromQuery is executed.
  5. multipleValuesPerDocument. Whether the fromField contains more than one value per document (multivalued field). If this option is set to true the from terms can be collected in a more efficient manner.

The the static join method returns a query that can be executed on an IndexSearcher to retrieve all documents that have terms in the toField that match with the collected from terms. Only the entry point for joining is exposed to the user; the actual implementation completely hidden, allowing Lucene committers to change the implementation without breaking API backwards compatibility.

The query time joining is based on indexed terms and is currently implemented as two pass search. The first pass collects all the terms from a fromField (in our case the article identifier field) that match the fromQuery. The second pass returns all documents that have matching terms in a toField (in our case the article identifier field in a comment document) to the terms collected in the first pass.

The query that is returned from the static join method can also be executed on a different IndexSearcher than the IndexSearcher used as an argument in the static join method. This flexibility allows anyone to join data from different indexes; provided that the toField does exist in that index. In our example this means the article and comment data can reside in two different indices. The article index might not change very often, but the comment index might. This allows you to fine tune these indexes specific to each needs.

Lets see how one can use the query time joining! Assuming the we have indexed the content that is shown in the image above, we can now use the query time joining. Lets search for the comments that have ‘byte norms’ as article title:

 IndexSearcher articleSearcher = ...
IndexSearcher commentSearcher = ...
String fromField = "id";
boolean multipleValuesPerDocument = false;
String toField = "article_id";
// This query should yield article with id 2 as result
BooleanQuery fromQuery = new BooleanQuery();
fromQuery.add(new TermQuery(new Term("title", "byte")), BooleanClause.Occur.MUST);
fromQuery.add(new TermQuery(new Term("title", "norms")), BooleanClause.Occur.MUST);
Query joinQuery = JoinUtil.createJoinQuery(fromField, multipleValuesPerDocument, toField, fromQuery, articleSearcher);
TopDocs topDocs =, 10);

If you would run the above code snippet the topDocs would contain one hit. This hit would referer to the Lucene id of the comment which has value 1 in the field with name “id”. Instead of seeing the article as result you the comment that matches with the article that matches the user’s query.

You could also change the example and give all articles that match with a certain comment query. In this example the multipleValuesPerDocument is set to false and the fromField  (the id field) only contains one value per document. However, the example would still work if multipleValuesPerDocument  variable were set to true, but it would then work in a less efficient manner.

The query time joining isn’t finished yet. There is still work todo and we encourage you to help with!

  1. Query time joining that uses doc values instead of the terms in the index. During text analysis the original text is in many cases changed. It might happen that your id is omitted or modified before it is added to the index. As you might expect this can result in unexpected behaviour during searching. A commonn work-around is to add an extra field to your index that doesn’t do text analysis. However this just adds a logical field that doesn’t actually adds meaning to your index. With docvalues you wouldn’t have an extra logical field and values are analysed.
  2. More sophisticated caching. Currently not much caching happens. Documents that are frequently joined, because the fromTerm is hit often, aren’t cached at all.

Query time joining is quite straight forward it use and provides a solution the search through relational data. As described there are other ways of performing this. How did you solve your relation requirements in your Lucene based search solution? Let us know and share your experiences and approaches!