IndexDocValues – their applications
From a user’s perspective Lucene’s IndexDocValues is a bunch of values per document. Unlike Stored Fields or FieldCache, the IndexDocValues’ values can be retrieved quickly and efficiently as Simon Willnauer describes in his first IndexDocValues blog post. There are many applications that can benefit from using IndexDocValues for search functionality like flexible scoring, faceting, sorting, joining and result grouping.
In general when you have functionality like sorting, faceting and result grouping you usually need an extra dedicated field. Let’s take an author field as an example; you may tokenize values for fulltext search which can cause faceting or result grouping to behave unexpectedly. However, by simply adding a untokenized field with a prefix like facet_author would fix this problem.
Yet, IndexDocValues are not tokenized or manipulated so you can simply add a IndexDocValuesField with the same name to achieve the same as with an additional field. Using IndexDocValues you end up with less logical fields and in this case less is more.
Some applications that use Lucene run in a memory limited environment, consequently functionalities like grouping and sorting cannot be used with a medium to large index. IndexDocValues can be retrieved in disk resident manner, meaning that document values aren’t loaded into the Java heap space. Applications using hundreds of fields where each of them could be used for sorting can quickly run out of memory when FieldCaches are pulled. Disk resident IndexDocValues can help to overcome this limitation by paying a price for search performance regression between 40% and 80%.
At this stage a lot of Lucene / Solr’s features are trying to catch up with IndexDocValues but some of them are already there. Let’s take a closer look at some of the features they already have.
As mentioned using IndexDocValues you end up with less logical fields. In Solr however this should even be simpler. You should just be able to mark a field to use the IndexDocValues but unfortunately this has not yet been exposed in Solr’s schema.xml. However there is an issue open (SOLR-2753) aiming to address this limitation if you are in the need of using IndexDocValues & you are using the Solr trunk it’s a great opportunity to get involved into the Solr development community. If you feel like it give it a go and code something up.
Result grouping by IndexDocValues is still work in progress but will be committed within the next week or two. Making use of IndexDocValues for result grouping is as straight forward as grouping by regular indexed values via FieldCache. If you are in Lucene land you can simply use a IndexDocValues GroupingCollectors and get started. (see Listing 1)
Query query = new TermQuery(new Term("description", "lucene")); Sort groupSort = new Sort(new SortField("name", SortField.Type.STRING)); Sort sortWithinGroup = new Sort(new SortField("age", SortField.Type.INT)); int maxGroupsToCollect = 10; int maxDocsPerGroup = "3"; String groupField = "groupField"; ValueType docValuesType = ValueType.VAR_INTS; boolean diskResident = true; int groupOffset = 0; int offsetInGroup = 0; AbstractFirstPassGroupingCollector first = DVFirstPassGroupingCollector.create( groupSort, maxGroupsToCollect, groupField, docValuesType, diskResident ); searcher.search(query, first); Collection<SearchGroup<BytesRef>> topGroups = first.getTopGroups(groupOffset, false); AbstractSecondPassGroupingCollector<BytesRef> second = DVSecondPassGroupingCollector.create( groupField, diskResident, docValuesType, topGroups, groupSort, sortWithinGroup, maxDocsPerGroup, true, true, false ); TopGroups<BytesRef> result = second.getTopGroups(offsetInGroup);
Tip: If you’re dealing with IndexSearcher instances use Lucene’s SearcherManager
Note, the code is the same for all other types of grouping collectors as described in the grouping contrib page. The only new options are diskResident and valueType. The diskResident option controls whether the values aren’t loaded in the Java heapspace but read from disk directly utilizing the filesystems memory. The valueType defines the type of the value like FIXED_INT_32, FLOAT_32 or BYTES_FIXED_STRAIGHT. The ValueType variants also provide flexibility for how those values are stored. When indexing IndexDocValues you have to define “what” and “how” you are storing which should be consistent throughout your indexing process. However, Lucene provides some under-the-hood TypePromotion if you index different types on the same field across different indexing sessions.
Tip: The bytes type has a few variants. One variant is sorted. Using this variant you will see that you gain much more performance when using grouping.
Sorting by float and int and bytes typed doc values are already committed to the trunk and can be used in Lucene land. Making use of this feature for sorting follows the same path as FieldCache based sorting. By default Lucene uses the FieldCache unless you specify the SortField#setUseIndexValues(boolean)
IndexSearcher searcher = new IndexSearcher(...); SortField field = new SortField("myField", SortField.Type.STRING); field.setUseIndexValues(true); Sort sort = new Sort(field); TopDocs topDocs = searcher.search(query, sort);
In general, using IndexDocValues instead of FieldCache is pretty straight forward. No additional API changes are required.
Besides sorting and result grouping there is more IndexDocValues work to do. Other features such as faceting and joining should have doc value based implementations as well.
I think that slowly features that use indexed term based implementations will move towards doc values based implementations. In my opinion, this is the way forward, then the inverted index is really used what it was built for in the first place: free text search.