Trifork Blog

Improved field collapse response

November 11th, 2009 by
|

In the most recent contribution to field collapsing I have improved the response format. The old format was not properly structured, the naming of the elements not self explanatory and in some situations the response was even flawed. From my opinion a better response format was necessary in order to improve the stability of the patch and to make parsing the response easier.

Lets take a look at the original response format.

<lst name="collapse_counts">
   <str name="field">city</str>
   <lst name="doc">
      <int name="233238">1</int>
      <int name="234338">1</int>
   </lst>
   <lst name="count">
      <int name="Amsterdam">1</int>
      <int name="Rotterdam">1</int>
   </lst>
   <lst name="collapsedDocs">
       <result name="Amsterdam" numFound="1" start="0">
          <doc>
             <str name="id">213133</str>
             <str name="city">Amsterdam</str>
          </doc>
       </result>
       <result name="Rotterdam" numFound="1" start="0">
          <doc>
             <str name="id">213123</str>
             <str name="city">Rotterdam</str>
          </doc>
       </result>
    </lst>
    <lst name="aggregatedResults">
       <lst name="sum(stock)">
           <str name="Amsterdam">10</str>
               ...
       </lst>
       <lst name="min(price)">
           <str name="Amsterdam">5.99</str>
            ...
       </lst>
    </lst>
</lst>

As you can see there were a number of things that needed improvement:

  1. The structure itself. Lets say you want to retrieve the collapse information from one document in the response. In the old response format you have to lookup the information in different places in the collapse response.
  2. The names of some elements like doc and count are somewhat vague.
  3. The count list uses field name as unique key, but when using adjacent collapsing the field value alone is not unique any more. This can then lead to parsing errors, because two documents in the response can have the same field value but positioned in different places in the search result.

Now take a look at the new response format:

<lst name="collapse_counts">
    <str name="field">venue</str>
    <lst name="results">
        <lst name="233238">
            <str name="fieldValue">Amsterdam</str>
            <int name="collapseCount">1</int>
            <result name="collapsedDocs" numFound="1" start="0">
                <doc>
                     <str name="id">213133</str>
                     <str name="city">Amsterdam</str>
                </doc>
            </result>
            <lst name="aggregate">
                 <str name="sum(stock)">10</str>
                 <str name="min(price)">5.99</str>
            </lst>
        </lst>
        <lst name="234338">
            <str name="fieldValue">Rotterdam</str>
            <int name="collapseCount">1</int>
            <result name="collapsedDocs" numFound="1" start="0">
                <doc>
                    <str name="id">213123</str>
                    <str name="city">Rotterdam</str>
                </doc>
            </result>
            <lst name="aggregate">
                <str name="sum(stock)">5</str>
                <str name="min(price)">3.99</str>
            </lst>
        </lst>
    </lst>
</lst>

I think this response format makes much more sense now. The response is now centered around collapse groups. A collapse group represents documents that were collapsed during the search. A collapse group is identifier by the most relevant document of that collapse group, which is document that did not get collapsed and remained present in the search result.

<lst name="results">
     <!-- collapse group identifier by the id of the most relevant document of the collapse group -->
     <lst name="233238">
          <!-- elements providing information about the collapse group such as collapse count and field value of the most relevant document -->
     </lst>
      ....
</lst>

Aggregate information like sum and average are located inside the collapse group and it is absolutely clear where they belong to.
I have also updated the SolrJ code for the response format. The API did change, so any previous code written that uses the field collapse SolrJ api will not work any more.
Lets take a look at the new methods:

FieldCollapseResponse result = queryResponse.getFieldCollapseResponse();
// returns a list of all collapse groups
List<FieldCollapseResponse.CollapseGroup> groups = result.getCollapseGroups();
// returns the collapse group identifier
String collapseGroupId = groups.get(0).getCollapseGroupId()
// returns the number of documents collapsed under the collapse group
int collapseCount = groups.get(0).getCollapseCount()
// returns the field value of the document that has the same id as the collapse group identifier
String fieldValue = groups.get(0).getFieldValue()
// returns all collapsed documents of the collapsed group
SolrDocumentList collapsedDocuments = groups.get(0).getCollapsedDocuments()
// returns result of executed aggregated functions
Map<String, String> aggregateFunctions = groups.get(0).getAggregateFunctions()

As you can see here the API is centred around the collapse group as well.

One Response

  1. […] As highlighted by Grant Ingersoll, the Solr committer who released the figures, the most voted for feature by far is field collapsing, otherwise known as result grouping, an extension to Solr primarily developed and maintained by my colleague Martijn van Groningen. Field collapsing allows results which have the same value for a certain field to be collapsed into a single result. This can prove useful when you get 1000 variations of the same heater as the result of a query, as is the case for one of our current customers. You can find out more about field collapsing from Martijn’s blog entries. […]