Result grouping / Field Collapsing with Solr

by Martijn van Groningen, October 20, 2009

In a number of search projects that I have done using Lucene and Solr there was a lot of almost identical data. From a user perspective, when searching the first result pages were full of documents that look very similar, for instance getting a full page of the same car model, where only the edition differs, when searching for a specific car brand. What actually is desired is to only show the different models. Then and only when a user is interested in a certain model, the user can view all the editions of the model by clicking on the result. We simply want to group our search result, based on some criteria. Although this is not support out-of-the-box with Lucene/Solr, luckily it is possible using a patch that I’ve created and contributed to Solr. This blog entry explains what result grouping (also known as field collapsing) is and how you can start using it in your own projects.

Result grouping allows you to group results by a predefined field (E.g. model field). Only the most relevant documents per distinct field value of the predefined field will be kept in the result. The specified sort determines the relevance per document. By default in Solr the score is used for sorting, but that can also be a field value or a computed value like distance. In the Solr community result grouping is better known as field collapsing.

Assume we are searching for books. One search with field collapsing and one without and as you can see in the image.

As illustrated in the image, the similar values are removed from the result, only the most relevant documents are being kept in the result set.

Field collapsing can in some way be compared with the SQL GROUP BY statement. Although you cannot yet use functions like sum() or avg() to gather statistics (yet), it does remove the less relevant documents and keeps a count of how many documents were removed per distinct field value. In the most recent version of the patch it is possible to collect the field values of the collapsed documents. This allows you to execute your own function on the collapsed documents.

Setting up field collapsing

Unfortunately Solr does not support field collapsing out-of-the-box yet. The functionality is still under development, but it can already be used and many people have successfully done that already. If you browse to the Jira issue SOLR-236 you can see the current status of the field collapsing functionality. Download the latest patch, apply it to the latest Solr Subversion trunk and you are good to go.

Configuring field collapsing

Field collapsing is currently implemented in Solr as a SearchComponent and thus must be configured in the solrconfig.xml. The following line adds the field collapse component to Solr:

<br>
&lt;searchComponent name="query"<br>
          class="org.apache.solr.handler.component.CollapseComponent" /&gt;<br>

The QueryComponent is by default configured implicitly under the name query. By adding the CollapseComponent with the name query will make sure that the request handlers will automatically use the CollapseComponent instead of the default QueryComponent.

It also important to know upfront on what field you want to collapse. It is not possible to collapse on all types of fields. Currently, if you collapse on a field that is tokenized or multivalued an exception is thrown and the search is aborted.

I usually create dedicated field collapse fields in my schema.xml with a collapse_ prefix. I think that this is a good practice and it emphasis the use for that particular field. You can use any type of field you want (as long as it is not tokenized and not multivalued), the non-analyzed field types like StringField and IntField are good candidates.

Group your results

Now that you have configured field collapsing you can actually group your search results. To enable field collapsing you need to specify the field.collapse parameter in your request to Solr. Assume we want to group results on a field named ‘author’. This would result in the following url:
http://localhost:8080/solr/select?q=*:*&collapse.field=collapse_author

When the request returns a search result similar to the following is returned:

<br>
&lt;lst name="responseHeader"&gt;<br>
 &lt;int name="status"&gt;0&lt;/int&gt;<br>
 &lt;int name="QTime"&gt;117&lt;/int&gt;<br>
 .....<br>
&lt;lst name="collapse_counts"&gt;<br>
 &lt;str name="field"&gt;collapse_city&lt;/str&gt;<br>
 &lt;lst name="doc"&gt;<br>
  &lt;int name="190810"&gt;48&lt;/int&gt;<br>
  &lt;int name="192224"&gt;9&lt;/int&gt;<br>
  ...<br>
 &lt;/lst&gt;<br>
 &lt;lst name="count"&gt;<br>
  &lt;int name="Amsterdam"&gt;48&lt;/int&gt;<br>
  &lt;int name="Rotterdam"&gt;9&lt;/int&gt;<br>
  ...<br>
 &lt;/lst&gt;<br>
&lt;/lst&gt;<br>
&lt;result name="response" numFound="26" start="0" maxScore="1.9735361"&gt;<br>
 &lt;doc&gt;<br>
  &lt;str name="city"&gt;Amsterdam&lt;/str&gt;<br>
  &lt;str name="id"&gt;190810&lt;/str&gt;<br>
  ...<br>
 &lt;/doc&gt;<br>
 ...<br>
&lt;/result&gt;<br>
&lt;/response&gt;<br>

There are two differences between this response and a response without field-collapsing:

A list with the name collapse_counts is added to the response with the collapse counts per field value and per document identifier. The document identifiers in the collapse_counts are referring to the documents in the normal response.
The response only contains the most relevant documents per group also known as the group heads. The term ‘group’ here means all documents with the same field value.

In the collapse_counts list there are two other lists. The doc list and the count list. Both are containing the collapse counts for the search result. The doc list associates the collapse counts to the result set by using the document head identifiers as pointer. Whereas the count list uses the field values to associate the collapse counts to the result set. It is important to know that both lists are referring to documents or field values in the current result page only and not to documents beyond that.

Besides the field.collapse parameter, there are more parameters that you can specify to tweak your groups in your result. They are described on the Field Collapsing page on the Solr wiki.

Collapsing algorithms

There are two distinct ways of collapsing your search results:

Adjacent field collapsing only collapses as the word adjacent implies documents with the same field value that appear in the non collapsed result set next to each other.
Non adjacent field collapsing, also known as normal field collapsing. This collapse algorithm collapses as described in the beginning of this blog entry and is the default collapsing algorithm.

The type of field collapsing can be controlled with the collapse.type parameter. When the value adjacent is specified the adjacent algorithm kicks in and when the value normal is specified the normal algorithm kicks in.

Including collapsed results

In some occasions it is handy to know specific field values of the collapsed documents. In the most recent versions of the field collapse patch it is possible to include collapsed results. This can be achieved by using the collapse.includeCollapsedDocs.fl parameter. The patch expects a comma separated list of field names to include or a star (*) that instructs field collapsing to include all fields.

When the search has completed a collapse document result similar to the following will be returned:

<br>
&lt;lst name="collapsedDocs"&gt;<br>
   &lt;result name="Amsterdam" numFound="48" start="0"&gt;<br>
  	&lt;doc&gt;<br>
          &lt;str name="id"&gt;191178&lt;/str&gt;<br>
          ...<br>
      &lt;/doc&gt;<br>
      ...<br>
   &lt;/result&gt;<br>
   &lt;result name=”Rotterdam” numFound=”9” start=”0”&gt;<br>
   ...<br>
   &lt;/result&gt;<br>
&lt;/lst&gt;<br>

The collapsedDocs is part of the collapse_counts response and as you can see the collapsed documents are grouped under a distinct field value.

Using SolrJ

If you are using SolrJ to integrate with your Solr instance you can use the added field collapse methods.
On the SolrQuery class I have added two methods:

enableFieldCollapsing(String) which accepts a field name as argument.
includeCollapsedDocuments(String...) which accepts zero or more field names. When no field names are given all fields are returned, otherwise only the specified field names are returned.

On the QueryResponse class one method is added:

getFieldCollapseResponse() which returns the FieldCollapseResponse. The objects contains all the field collapse information.

The FieldCollapseResponse had four getter methods:

getCollapseField() returns the field name during field collapsing.
getFieldValueCollapseCounts() returns a list of FieldValueCollapseCount, that contains a field value with a collapse count.
getDocumentIdCollapseCounts() returns a list of DocumentIdCollapseCount, that contains a document id with a collapse count.
getCollapsedDocuments() returns a map with field value as key and a SolrDocumentList with the collapsed documents as value.

These methods can ease development when using field collapsing while integrating with a front-end system.

Field collapsing and facets

Field collapsing in combination with facets can be confusing the first time. The reason of that is that faceting can be performed on the ‘collapsed’ or ‘non collapsed’ result set. The facet counts on the ‘collapsed’ result set are usually less than the facet counts on the ‘non collapsed’ result set. Whether you want this is up to you because you can influence this behavior. The parameter collapse.facet determines on what result set to collapse. This parameter can have the value facet.before to collapse on the non collapsed result set or facet.after to collapse on the result set. The default behavior is to collapse on the collapsed result set. The performance for faceting on either the collapse or non collapsed result set from the field collapse perspective is the same.

Field collapsing and performance

Unfortunately field collapsing does influence the search time in a negative way. When doing a search with field collapsing enabled the search time can be 5 to 10 times slower than doing a search without field collapsing enabled. There are more things that can make your search time even worse:

Using Adjacent collapsing as collapse type. Adjacent collapsing can be an order of magnitude slower than non adjacent field collapsing. I have seen cases where performance dropped by more than nine times compared to normal field collapsing.
Using a collapse threshold higher than 1 in combination with normal collapsing. This has to do with the way the normal collapsing algorithm processes the documents that may be kept in the result. For a collapse threshold higher than 1 in combination with adjacent collapsing the performance will not worsen.
Including collapsed documents in the response. How much this feature increases the search time depends on how many documents are being collapsed and how many are being returned in the response. The latter decreases performance the most, because the returned documents have to be read from the index and be sent over the wire. If for example, 8000 documents were collapsed for a specific field value, you can imagine how enormous the increase in response time will be.

JTeam’s involvement in SOLR-236

As already mentioned, you can find the field collapsing patch in Solr JIRA (SOLR-236). This patch has been around for quite some time now, but due to an increasing demand from our clients, in the last year we at JTeam put a lot of effort in improving it and making it production ready. Some of the enhancements we have made recently include:

Performance improvement with the normal field collapse algorithm.
Performance improvement when faceting on the non collapsed result set.
The ability to include documents that have been collapsed.
Improved the code quality by adding unit and integration tests. Redesigned the solution code wise that resulted in cleaner code and thus more maintainable code.
Extended the SolrJ API to allow easy integration when using field collapsing.

As always, we’re committed to continue working with the community and contributing to this issue as much as we can. We find this feature extremely handy and we’re definitely not alone as the demand for it is extremely high (it so happens to be the most voted for feature in Solr’s JIRA).