Result grouping made easier
Lucene has result grouping for a while now as a contrib in Lucene 3.x and as a module in the upcoming 4.0 release. In both releases the actual grouping is performed with Lucene Collectors. As a Lucene user you need to use various of these Collectors in searches. However these Collectors have many constructor arguments. So they can become quite cumbersome to use grouping in pure Lucene apps. The example below illustrates this.
TermFirstPassGroupingCollector c1 = new TermFirstPassGroupingCollector("author", groupSort, groupOffset+topNGroups); boolean cacheScores = true; double maxCacheRAMMB = 4.0; CachingCollector cachedCollector = CachingCollector.create(c1, cacheScores, maxCacheRAMMB); s.search(new TermQuery(new Term("content", searchTerm)), cachedCollector); Collection<SearchGroup<BytesRef>> topGroups = c1.getTopGroups(groupOffset, fillFields); if (topGroups == null) { // No groups matched return; } boolean getScores = true; boolean getMaxScores = true; boolean fillFields = true; TermSecondPassGroupingCollector c2 = new TermSecondPassGroupingCollector("author", topGroups, groupSort, docSort, docOffset+docsPerGroup, getScores, getMaxScores, fillFields); TermAllGroupsCollector allGroupsCollector = new TermAllGroupsCollector("author"); c2 = MultiCollector.wrap(c2, allGroupsCollector); if (cachedCollector.isCached()) { // Cache fit within maxCacheRAMMB, so we can replay it: cachedCollector.replay(c2); } else { // Cache was too large; must re-execute query: s.search(new TermQuery(new Term("content", searchTerm)), c2); } TopGroups<BytesRef> groupsResult = c2.getTopGroups(docOffset); groupsResult = new TopGroups<BytesRef>(groupsResult, allGroupsCollector.getGroupCount()); // Render groupsResult...
In the above example basic grouping with caching is used and also the group count is retrieved. As you can see there is quite a lot of coding involved. Recently a grouping convenience utility has been added to the Lucene grouping module to alleviate this problem. As the code example below illustrates, using the GroupingSearch utility is much easier than interacting with actual grouping collectors.
Normally the document count is returned as hit count. However in the situation where groups are being used as hit, rather than a document the document count will not work with pagination. For this reason the group count can be used the have correct pagination. The group count returns the number of unique groups matching the query. The group count can in the case be used as hit count since the individual hits are groups.
GroupingSearch groupingSearch = new GroupingSearch("author"); groupingSearch.setGroupSort(groupSort); groupingSearch.setFillSortFields(fillFields); groupingSearch.setCachingInMB(4.0, true); groupingSearch.setAllGroups(true); TermQuery query = new TermQuery(new Term("content", searchTerm)); TopGroups<BytesRef> result = groupingSearch.search(indexSearcher, query, groupOffset, groupLimit); // Render groupsResult... Integer totalGroupCount = result.totalGroupCount; // The group count if setAllGroups is set to true, otherwise this is null
The GroupingSearch utility is only added to trunk meaning that it will be released with the Lucene 4.0 release. If you can’t wait you can always use a nightly build or checkout the trunk yourself. It is important to keep in mind that the GroupingSearch utility uses the already existing grouping collectors to perform the actual grouping. The GroupingSearch utility has four different constructors for each grouping type. Grouping by indexed terms, function, doc values and doc block. The first one is used the example above. The rest is described below.
FloatFieldSource field1 = new FloatFieldSource("field1"); FloatFieldSource field2 = new FloatFieldSource("field2"); SumFloatFunction sumFloatFunction = new SumFloatFunction(new ValueSource[]{field1, field2}); GroupingSearch groupingSearch = new GroupingSearch(sumFloatFunction, new HashMap<Object, Object>()); ... TopGroups<MutableValue> result = groupingSearch.search(searcher, query, 0, 10);
Grouping by function uses the ValueSource abstraction from the Lucene queries module, consequently the grouping module depends on the queries module. In the above example grouping is performed on the sum of field1 and field2. The group type in the result is always of type MutableValue when grouping by a function.
boolean diskResident = true; DocValues.Type docValuesType = DocValues.Type.BYTES_VAR_SORTED; GroupingSearch groupingSearch = new GroupingSearch("author", docValuesType, diskResident); ... TopGroups<BytesRef> result1 = groupingSearch.search(indexSearcher, query, groupOffset, groupLimit); // grouping by var int docvalues DocValues.Type docValuesType = DocValues.Type.VAR_INTS; GroupingSearch groupingSearch = new GroupingSearch("author", docValuesType ... TopGroups<Long> result2 = groupingSearch.search(indexSearcher, query, groupOffset, groupLimit);
Grouping by docvalues requires you to specify a DocValues.Type up front and whether the doc values should be read disk resident. It is important that the DocValues.Type is the same as was used when indexing the data. A different DocValues.Type can lead to different group type in the result as you can see in the above code sample. DocValues.Type.VAR_INTS results in a Long type and DocValues.Type.BYTES_VAR_SORTED in a ByteRef type.
Filter lastDocInBlock = new CachingWrapperFilter(new QueryWrapperFilter(new TermQuery(new Term("groupend", "x")))); GroupingSearch groupingSearch = new GroupingSearch(lastDocInBlock); Query query = new TermQuery(new Term("content", "random")) // Render groupsResult TopGroups<?> result = groupingSearch.search(indexSearcher, query, 0, 10); ...
Grouping by doc block requires you to specify a Filter marks the last document of each block. Obviously your data has to be indexed in a block using the IndexWriter’s addDocuments(…) method.
The GroupingSearch utility class doesn’t cover all use cases yet. It only works locally meaning it doesn’t help you with distributed grouping also it lacks a few features like grouped facets. I think this utility class is good start to make use of result grouping a bit easier than it was before. Work on making result grouping easier to use for pure Lucene apps hasn’t finished and features like distributed grouping will be made easier to use.
Have you used result grouping in your search solution either directly with Lucene or via Solr? Adding result grouping to your search solution on a large scale can be challenging! Let us know how you solved your requirements with result grouping by posting a comment.