Search Result Grouping / Field Collapsing in Lucene / Solr
Grouping of search results or also known as field collapsing is often a requirement for search projects. As described earlier this functionality was added to Solr and happens to be one of the most wanted features in Solr. Recently result grouping was added to Lucene as contrib in Lucene 3.1 and a module in 4.0. Adding the functionality to Lucene makes the feature much more flexible to use. Effort is currently put in to add the result grouping contrib in the 3.x branch to Solr. See SOLR-2524 for more information. This means that grouping will most likely be available in Solr 3.2!
History
It all began about 4 years ago when the SOLR-236 issue was created. Back then result grouping was known as field collapsing and the functionality was more focused on collapsing documents in the result set that have the same field value. The patch that was attached to this issue expanded over time and more people started to using it. Features were added and improvements were made by many people. The field collapse feature stayed as a patch in the Jira for more than 3 years. The only option for Solr users that wanted to use it was patch Solr and run on that built version. This is obviously error prone and many questions regarding this subject were sent to the Solr mailing lists. Besides that, there were many other Jira issues and patches related to field collapsing, which confused people even more!
Last september result grouping became available in the trunk (4.0-dev). The field collapse functionality was rewritten to a grouping functionality (SOLR-1682) and the performance was improved dramatically. Also, result grouping by function was added, so the feature slightly changed.
More recently, effort was put into LUCENE-1421. This Jira issue was created with the intent to expose result grouping to Lucene. The grouping feature in the Solr trunk was rewritten and put into a grouping module in Lucene. It has also been backported to 3.x branch as Lucene contrib. Currently the only features it doesn’t support are grouping by function and by query. LUCENE-3099 has been created to add these capabilities to Lucene soon.
Result Grouping in Lucene
Grouping in Lucene is implemented as collectors and are really easy to use as is shown in the following code samples. There is a FirstPassGroupingCollector to collect the top N most relevant documents per group. The SecondPassGroupingCollector gathers documents within the top N groups.
FirstPassGroupingCollector c1 = new FirstPassGroupingCollector("author", groupSort, groupOffset + topNGroups); indexSearcher.search(new TermQuery(new Term("content", searchTerm)), c1); Collection<SearchGroup> topGroups = c1.getTopGroups(groupOffset, fillFields); if (topGroups == null) { // No groups matched return; } boolean getScores = true; boolean getMaxScores = true; boolean fillFields = true; SecondPassGroupingCollector c2 = new SecondPassGroupingCollector("author", topGroups, groupSort, docSort, docOffset + docsPerGroup, getScores, getMaxScores, fillFields); indexSearcher.search(new TermQuery(new Term("content", searchTerm)), c2); TopGroups groupsResult = c2.getTopGroups(docOffset);
If the searches are expensive you might want to consider using the CachingCollector. This collector can cache the document ids and score from the first pass search and replay it during the second pass search. See the grouping documentation for its usage.
There is also another collector named the AllGroupsCollector that is concerned with collecting all groups that match a query. This can for example be used to get the total count based on groups.
// First pass search has been executed boolean getScores = true; boolean getMaxScores = true; boolean fillFields = true; AllGroupsCollector c3 = new AllGroupsCollector("author"); SecondPassGroupingCollector c2 = new SecondPassGroupingCollector("author", topGroups, groupSort, docSort, docOffset + docsPerGroup, getScores, getMaxScores, fillFields); indexSearcher.search(new TermQuery(new Term("content", searchTerm)), MultiCollector.wrap(c2, c3)); TopGroups groupsResult = c2.getTopGroups(docOffset); groupsResult = new TopGroups(groupsResult, c3.getGroupCount());
The AllGroupsCollector
can be nicely wrapped with the the SecondPassGroupingCollector
in the second pass search with the MultiCollector
. The AllGroupsCollector
can also be used independently from other collectors.
Result Grouping in Solr
Currently the grouping in the Solr trunk doesn’t use the Lucene grouping module. It uses its own grouping implementation. The reason why Solr is not using the grouping module yet, is that grouping by function and query needs to be supported first. However grouping hasn’t yet been implemented in Solr 3.1 The downside is that Solr users still need to patch and build their own version to be able to group results. Even worse, most users use one of the obsolete patches in SOLR-236 that have been adapted to work with Solr 3.1. That is one of the reasons why I created SOLR-2524.
The SOLR-2524 issue is concerned with integrating the Lucene contrib module into the branch 3.x Solr. This issue also serves as reference to integrate the grouping module into the trunk version of Solr (4.0). The branch 3.x Solr grouping will be supporting the same response formats and request parameters as described on the Solr FieldCollapse wiki page. The only parameters it doesn’t support (yet) are those regarding grouping by function and query.
If all goes well this issue will be committed soon and included in the Solr 3.2 release. And thus giving Solr users the grouping feature out-of-the-box!