Trifork Blog

Result grouping / Field Collapsing with Solr

October 20th, 2009 by
| Reply

In a number of search projects that I have done using Lucene and Solr there was a lot of almost identical data. From a user perspective, when searching the first result pages were full of documents that look very similar, for instance getting a full page of the same car model, where only the edition differs, when searching for a specific car brand. What actually is desired is to only show the different models. Then and only when a user is interested in a certain model, the user can view all the editions of the model by clicking on the result. We simply want to group our search result, based on some criteria. Although this is not support out-of-the-box with Lucene/Solr, luckily it is possible using a patch that I've created and contributed to Solr. This blog entry explains what result grouping (also known as field collapsing) is and how you can start using it in your own projects.


Result grouping allows you to group results by a predefined field (E.g. model field). Only the most relevant documents per distinct field value of the predefined field will be kept in the result. The specified sort determines the relevance per document. By default in Solr the score is used for sorting, but that can also be a field value or a computed value like distance. In the Solr community result grouping is better known as field collapsing.

Assume we are searching for books. One search with field collapsing and one without and as you can see in the image.
fieldcollapse
As illustrated in the image, the similar values are removed from the result, only the most relevant documents are being kept in the result set.

Field collapsing can in some way be compared with the SQL GROUP BY statement. Although you cannot yet use functions like sum() or avg() to gather statistics (yet), it does remove the less relevant documents and keeps a count of how many documents were removed per distinct field value. In the most recent version of the patch it is possible to collect the field values of the collapsed documents. This allows you to execute your own function on the collapsed documents.

Setting up field collapsing

Unfortunately Solr does not support field collapsing out-of-the-box yet. The functionality is still under development, but it can already be used and many people have successfully done that already. If you browse to the Jira issue SOLR-236 you can see the current status of the field collapsing functionality. Download the latest patch, apply it to the latest Solr Subversion trunk and you are good to go.

Configuring field collapsing

Field collapsing is currently implemented in Solr as a SearchComponent and thus must be configured in the solrconfig.xml. The following line adds the field collapse component to Solr:

<searchComponent name="query" 
          class="org.apache.solr.handler.component.CollapseComponent" />

The QueryComponent is by default configured implicitly under the name query. By adding the CollapseComponent with the name query will make sure that the request handlers will automatically use the CollapseComponent instead of the default QueryComponent.

It also important to know upfront on what field you want to collapse. It is not possible to collapse on all types of fields. Currently, if you collapse on a field that is tokenized or multivalued an exception is thrown and the search is aborted.

I usually create dedicated field collapse fields in my schema.xml with a collapse_ prefix. I think that this is a good practice and it emphasis the use for that particular field. You can use any type of field you want (as long as it is not tokenized and not multivalued), the non-analyzed field types like StringField and IntField are good candidates.

Group your results

Now that you have configured field collapsing you can actually group your search results. To enable field collapsing you need to specify the field.collapse parameter in your request to Solr. Assume we want to group results on a field named 'author'. This would result in the following url:
http://localhost:8080/solr/select?q=*:*&collapse.field=collapse_author

When the request returns a search result similar to the following is returned:

<lst name="responseHeader">
 <int name="status">0</int>
 <int name="QTime">117</int>
 .....
<lst name="collapse_counts">
 <str name="field">collapse_city</str>
 <lst name="doc">
  <int name="190810">48</int>
  <int name="192224">9</int>
  ...
 </lst>
 <lst name="count">
  <int name="Amsterdam">48</int>
  <int name="Rotterdam">9</int>
  ... 	
 </lst>
</lst>
<result name="response" numFound="26" start="0" maxScore="1.9735361">
 <doc>
  <str name="city">Amsterdam</str>
  <str name="id">190810</str>
  ...
 </doc>
 ...
</result>
</response>

There are two differences between this response and a response without field-collapsing:

  1. A list with the name collapse_counts is added to the response with the collapse counts per field value and per document identifier. The document identifiers in the collapse_counts are referring to the documents in the normal response.
  2. The response only contains the most relevant documents per group also known as the group heads. The term 'group' here means all documents with the same field value.

In the collapse_counts list there are two other lists. The doc list and the count list. Both are containing the collapse counts for the search result. The doc list associates the collapse counts to the result set by using the document head identifiers as pointer. Whereas the count list uses the field values to associate the collapse counts to the result set. It is important to know that both lists are referring to documents or field values in the current result page only and not to documents beyond that.

Besides the field.collapse parameter, there are more parameters that you can specify to tweak your groups in your result. They are described on the Field Collapsing page on the Solr wiki.

Collapsing algorithms

There are two distinct ways of collapsing your search results:

  1. Adjacent field collapsing only collapses as the word adjacent implies documents with the same field value that appear in the non collapsed result set next to each other.
  2. Non adjacent field collapsing, also known as normal field collapsing. This collapse algorithm collapses as described in the beginning of this blog entry and is the default collapsing algorithm.

The type of field collapsing can be controlled with the collapse.type parameter. When the value adjacent is specified the adjacent algorithm kicks in and when the value normal is specified the normal algorithm kicks in.

Including collapsed results

In some occasions it is handy to know specific field values of the collapsed documents. In the most recent versions of the field collapse patch it is possible to include collapsed results. This can be achieved by using the collapse.includeCollapsedDocs.fl parameter. The patch expects a comma separated list of field names to include or a star (*) that instructs field collapsing to include all fields.

When the search has completed a collapse document result similar to the following will be returned:

<lst name="collapsedDocs">
   <result name="Amsterdam" numFound="48" start="0">
  	<doc>
          <str name="id">191178</str>
          ...   
      </doc>
      ...
   </result>
   <result name=”Rotterdam” numFound=”9” start=”0”>
   ...
   </result> 
</lst>

The collapsedDocs is part of the collapse_counts response and as you can see the collapsed documents are grouped under a distinct field value.

Using SolrJ

If you are using SolrJ to integrate with your Solr instance you can use the added field collapse methods.
On the SolrQuery class I have added two methods:

  1. enableFieldCollapsing(String) which accepts a field name as argument.
  2. includeCollapsedDocuments(String...) which accepts zero or more field names. When no field names are given all fields are returned, otherwise only the specified field names are returned.

On the QueryResponse class one method is added:

  1. getFieldCollapseResponse() which returns the FieldCollapseResponse. The objects contains all the field collapse information.

The FieldCollapseResponse had four getter methods:

  1. getCollapseField() returns the field name during field collapsing.
  2. getFieldValueCollapseCounts() returns a list of FieldValueCollapseCount, that contains a field value with a collapse count.
  3. getDocumentIdCollapseCounts() returns a list of DocumentIdCollapseCount, that contains a document id with a collapse count.
  4. getCollapsedDocuments() returns a map with field value as key and a SolrDocumentList with the collapsed documents as value.

These methods can ease development when using field collapsing while integrating with a front-end system.

Field collapsing and facets

Field collapsing in combination with facets can be confusing the first time. The reason of that is that faceting can be performed on the 'collapsed' or 'non collapsed' result set. The facet counts on the 'collapsed' result set are usually less than the facet counts on the 'non collapsed' result set. Whether you want this is up to you because you can influence this behavior. The parameter collapse.facet determines on what result set to collapse. This parameter can have the value facet.before to collapse on the non collapsed result set or facet.after to collapse on the result set. The default behavior is to collapse on the collapsed result set. The performance for faceting on either the collapse or non collapsed result set from the field collapse perspective is the same.

Field collapsing and performance

Unfortunately field collapsing does influence the search time in a negative way. When doing a search with field collapsing enabled the search time can be 5 to 10 times slower than doing a search without field collapsing enabled. There are more things that can make your search time even worse:

  • Using Adjacent collapsing as collapse type. Adjacent collapsing can be an order of magnitude slower than non adjacent field collapsing. I have seen cases where performance dropped by more than nine times compared to normal field collapsing.
  • Using a collapse threshold higher than 1 in combination with normal collapsing. This has to do with the way the normal collapsing algorithm processes the documents that may be kept in the result. For a collapse threshold higher than 1 in combination with adjacent collapsing the performance will not worsen.
  • Including collapsed documents in the response. How much this feature increases the search time depends on how many documents are being collapsed and how many are being returned in the response. The latter decreases performance the most, because the returned documents have to be read from the index and be sent over the wire. If for example, 8000 documents were collapsed for a specific field value, you can imagine how enormous the increase in response time will be.
  • JTeam's involvement in SOLR-236

    As already mentioned, you can find the field collapsing patch in Solr JIRA (SOLR-236). This patch has been around for quite some time now, but due to an increasing demand from our clients, in the last year we at JTeam put a lot of effort in improving it and making it production ready. Some of the enhancements we have made recently include:

    • Performance improvement with the normal field collapse algorithm.
    • Performance improvement when faceting on the non collapsed result set.
    • The ability to include documents that have been collapsed.
    • Improved the code quality by adding unit and integration tests. Redesigned the solution code wise that resulted in cleaner code and thus more maintainable code.
    • Extended the SolrJ API to allow easy integration when using field collapsing.

    As always, we're committed to continue working with the community and contributing to this issue as much as we can. We find this feature extremely handy and we're definitely not alone as the demand for it is extremely high (it so happens to be the most voted for feature in Solr's JIRA).

30 Responses

  1. November 18, 2009 at 23:54 by Paul Oakes

    Hi Martijn;

    Thanks for your work on creating this patch. Since you are very familiar with the evolution of this patch I was wondering if you can offer some advice on how to convince my deployment engineer to accept the latest patch for SOLR-236 for production use?

    Thanks,

    Paul

  2. November 19, 2009 at 00:03 by Paul Oakes

    One more question; will this work you've put towards the field-collapse-5.patch be incorporated into the 1.5 version? Thanks!

  3. November 24, 2009 at 23:33 by Peter

    This looks nice!
    How is this function compared to faceting and clustering?

  4. November 29, 2009 at 23:17 by Martijn van Groningen

    Sorry guys for the late response.

    @Paul
    Your first question is easy. The current patch is more stable then ever before :-) Also the latest patch has some nice caching that can improve your search time.

    Yes this patch will be added to Solr 1.5. SOLR-236 is the most voted issue in Jira and this issue has been around for about 2,5 years, so I guess that is something that ensures it will be added to next release.

    @Peter
    Well, field-collapsing reduces the results presented to the end user. It creates groups and only shows the most relevant document per group in the search results. So a search result of 100+ pages can become 1 or 2 pages (if used properly). For example duplicates are removed. Making the result better presentable to the end user or for example only display one company per search and not all its branches. This is something cannot do with faceting or clustering.

  5. November 29, 2009 at 23:39 by Mark

    Martijn,

    I notice that Field collapsing collapses the documents that "do not" have the collapse.field present into one of the collapse groups... I need to stop this behavior from happening and just leave those documents in the normal resultset... Any tips on how I can alter the patch to support this? I assume I am looking for a comparator or something that is used during the collapse calculation.

    thanks,

  6. December 1, 2009 at 13:11 by Martijn van Groningen

    Mark,

    Currently this is not really flexible. What you can do now is change the following for normal collapsing around line 88 in NonAdjacentDocumentCollapser:

    int currentId = i.nextDoc(); // existing line
    String currentValue = values.lookup[values.order[currentId]]; // existing line
    if (currentValue == null) { // add this if block to the patch
        continue;
    }
    

    The if block will ensure that documents that do not have the collapse field are not collapsed.

    In the next version of the patch I will add the notion of a CollapseCriteria. The CollapseCriteria will decide if a field value is equal to the field value of the collapse group. The above if block should then be added to the CollapseCriteria.

    Hope this helps

  7. December 4, 2009 at 15:28 by Cristian

    Hey Martijn,

    I need to know if there is a way to sort the results by the number of collapsed documents. I am using the collapsible patch with the Solr deduplication and I want to sort them by numbers of duplicates in a descending order (the ones that have more duplicates first).
    Please let me know if you have any idea of how I could do this.

    Thanks

  8. December 8, 2009 at 09:07 by Martijn van Groningen

    Hi Christian,

    That is an interesting requirement. One way to do this is to implement a new FieldComparator that holds for each uncollapsed document id a count of how many times it was collapsed. The FieldComparator can then sort based on that information.

    You need to add this field comparator via a SortField to the Sort of the request before executing DocListAndSet results = searcher.getDocListAndSet(...) in the CollapseComponent.doProcess method.

    Martijn

  9. [...] for one of our current customers. You can find out more about field collapsing from Martijn’s blog [...]

  10. March 16, 2010 at 05:27 by Jonathan Rochkind

    The "CollapseCriteria" idea seems really interesting, potentially enabling all sorts of flexible behavior. Is that feature still in the cards?

  11. March 26, 2010 at 20:43 by Jonathan Hendler

    Thanks for this post, and especially the patch. Just wrote a quick write up mentioning the feature.

    http://supercalafragilisticexpialadocio.us/field-collapsing-in-solr

  12. May 18, 2010 at 23:01 by Cruz Fernandez

    Martijn,

    I am looking at your response for Christian (ordering collapsed results).

    Apparently from what I am looking at the solr-lucene code, the ordering of the results depends from FieldComparator, but they are only implemented for the fields stored in the document inside Lucene.

    Is the collapse component just calling:

    DocListAndSet results = searcher.getDocListAndSet(…)

    for the ordering?

    Thanks in advance.

  13. [...] Introduction [...]

  14. June 10, 2010 at 02:32 by Neville Atkinson

    Hi,

    I'm having an issue similar to one mentioned by a couple of people on the Apache Jira page for Solr-236, namely that in order to support paging of results I need the numFound value in the response to reflect the number of total matching documents, not just the number of documents in the response. You gave a workaround of commenting out lines 99 and 106-110 of NonAdjacentDocumentCollapser, but if I do that NonAdjacentDocumentCollapser clearly won't compile:

    099 NonAdjacentCollapseGroup collapseDoc = collapsedDocs.get(currentValue);
    100 if (collapseDoc == null) {
    101 // new collapsing value => create a new record for it
    102 collapseDoc = new NonAdjacentCollapseGroup(0, 0, documentComparator, collapseThreshold, currentValue);
    103 collapsedDocs.put(currentValue, collapseDoc);
    104 collapsedGroupPriority.add(collapseDoc);
    105
    106 if (collapsedGroupPriority.size() > maxNumberOfGroups) {
    107 NonAdjacentCollapseGroup inferiorGroup = collapsedGroupPriority.first();
    108 collapsedDocs.remove(inferiorGroup.fieldValue);
    109 collapsedGroupPriority.remove(inferiorGroup);
    110 }
    111 }

    Could you please post the lines which should be removed. I'm using the latest version of the patch (2010-05-16 12:05 PM) and revision 950613 of Solr.

    Thanks.

  15. June 10, 2010 at 11:09 by Martijn van Groningen

    Cruz,

    The getDocListAndSet(...) is used for retrieving the document ids to be displayed. Ordering for these documents is done inside this method.

    When this method is called the collapse component has already created the collapse groups. The collapsed docset is given to this method as argument. The ordering of the documents inside the collapse groups is done in the DocumentCollapsers (by default NonAdjacentDocumentCollapser).

    Hope this answers your question.

  16. June 10, 2010 at 11:15 by Martijn van Groningen

    Neville,

    Line 99 in your example was the problem.

    Remove the following blocks:

    collapsedGroupPriority.add(collapseDoc);
    

    and:

    if (collapsedGroupPriority.size() > maxNumberOfGroups) {
       NonAdjacentCollapseGroup inferiorGroup = collapsedGroupPriority.first();
       collapsedDocs.remove(inferiorGroup.fieldValue);
       collapsedGroupPriority.remove(inferiorGroup);
    }
    

    You should not have compilation errors when you have removed these blocks.

  17. June 10, 2010 at 22:23 by Neville Atkinson

    Thanks Martijn, that worked perfectly.

  18. June 24, 2010 at 23:06 by Hannes Korte

    Hi Martijn,

    thanks for the interesting article. While testing the field collapsing feature I noticed, that there is a typo in the sentence "This can be achieved by using the collapse.includeCollapseDocs.fl parameter".

    The name of the parameter is "collapse.includeCollapsedDocs.fl", there is a "d" missing. Just to let you know.

    Best regards,
    Hannes

  19. June 25, 2010 at 08:47 by Martijn van Groningen

    Thanks for noticing. If have fixed the typo.

  20. October 12, 2010 at 17:50 by David

    Hi

    I followed the instructions in http://blog.jteam.nl/2009/10/20/result-grouping-field-collapsing-with-solr/comment-page-1/#comment-1206 and now I only receive results that can be collapsed, instead of those that can be collapsed and all of the other results

    Any ideas?

  21. October 13, 2010 at 12:03 by Martijn van Groningen

    Hi David,

    I'm not sure what you mean with:
    now I only receive results that can be collapsed, instead of those that can be collapsed and all of the other results

    The change you're referencing to has to do with not collapsing documents when the collapse field is null. Field collasping will keep documents out of the results that have appeared with a field value specified in collapse.field more then collapse.max times.

    Martijn

  22. October 13, 2010 at 12:41 by David

    Hey Martijn,

    I have say i have 100 documents 50 of those documents have the book id field and can be collapsed around that into say 10 collapses of 5 documents. The rest don't have the field at all so the value will be null. Before the patch mentioned above all of those documents would be collapsed under a null value collapse and the total records displayed in the return set would be 10 (the first book id document) plus 1 (all of the other null field collapsed docs)

    50

    after the patch I only get the first 10 book id documents and none of the rest.

    I need a way to group all those documents that match the grouping and leave all of the single documents alone.

    Hope this make sense

    Regards

    Dave

  23. October 13, 2010 at 12:44 by David

    Ok that kind of stripped my code will try again with just the code

    <lst name="first_null_doc_found_id">
    <int name="collapseCount">50</int>
    <null name="fieldValue"/>
    </lst>

  24. October 13, 2010 at 12:53 by Martijn van Groningen

    Hi David,

    It think the following piece of code will fix your problem.

    int currentId = i.nextDoc(); // existing line
    String currentValue = values.lookup[values.order[currentId]]; // existing line
    if (currentValue == null) { // add this if block to the patch
        addDoc(currentId);
        continue;
    }
    

    So in the piece of code mentioned here:http://blog.jteam.nl/2009/10/20/result-grouping-field-collapsing-with-solr/comment-page-1/#comment-1206

    Documents that don't have the field specified in the collapse.field will be left out in the result. The code change mentioned here will include them.

    I hope this helps.

    Martijn

  25. November 17, 2010 at 21:11 by bill m

    Thanks for the patch.

    We're interested in using this functionality with the latest production build, i.e. patching against the 1.4.1 tag.

    Sorry if this is sounds like a stupid question--we're not lucene-solr hackers--Could you provide some patching instructions?

    Is it enough to use SOLR-236-1_4_1.patch? Or do we need to apply additional .patches on the page (the jira issue page has a few dozen patches).

    Additionally, SOLR-236 has "child" issues. Do we need patches for these also?

    thanks

    bill

  26. November 18, 2010 at 10:43 by Martijn van Groningen

    To get field collapsing to work only the SOLR-236-1_4_1.patch is necessary. The child issues are related to the work on the trunk (4.0).

    Keep in mind that this patch is not perform well in large indexes (1M+). If you need field collapsing on large indexes you can try the trunk if that fits your requirements. On how many document are you planning to do field collapsing?

    Cheers,

    Martijn

  27. December 6, 2010 at 15:31 by Thalaiselvam

    Hi,
    Iam using Slor 1.4 with Dot net, i would like to implement the "Field.Collapse". so that i have added the line

    in Config.xml. But while restart the tomcat, it didn't start it throws the error...

    'org.apache.solr.handler.component.CollapseComponent' at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:373) at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:413) at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:435) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1498) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1492) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1525) at org.apache.solr.core.SolrCore.loadSearchComponents(SolrCore.java:833) at org.apache.solr.core.SolrCore.(SolrCore.java:551) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83) at org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:295) at org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:422) at org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:115) at org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3838) at org.apache.catalina.core.StandardContext.start(StandardContext.java:4488) at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791) at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:771) at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:526) at

    if required any patches...

    Kindly advoice..

    Thalaiselvam N

  28. December 7, 2010 at 12:07 by Thalaiselvam

    Iam getting error while enable the line "searchComponent name="collapse" class="org.apache.solr.handler.component.CollapseComponent"" in SolrConfig.xml

    error line is

    ‘org.apache.solr.handler.component.CollapseComponent’ at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:373) at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:413) at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:435) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1498) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1492) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1525) at org.apache.solr.core.SolrCore.loadSearchComponents(SolrCore.java:833) at org.apache.solr.core.SolrCore.(SolrCore.java:551) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83) at org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:295) at org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:422) at org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:115) at org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3838) at org.apache.catalina.core.StandardContext.start(StandardContext.java:4488) at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791) at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:771) at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:526) at

    if required any patches…

    Kindly advoice..

    Thalaiselvam N

  29. [...] original post can be found at http://blog.jteam.nl/2009/10/20/result-grouping-field-collapsing-with-solr/ [...]

  30. May 13, 2011 at 12:07 by Karanveer Singh

    Hi,

    After adding the required things in solrconfig.xml, I tried to run the example configuration using java -jar start.jar, and it gave a ClassNotFound error. It can't seem to find the CollapseComponent class

    Is there something else that needs to be added in the solrconfig.xml file?

    Thanks.

Leave a Reply