{"id":6391,"date":"2011-11-14T11:19:42","date_gmt":"2011-11-14T10:19:42","guid":{"rendered":"http:\/\/blog.trifork.nl\/?p=6391"},"modified":"2011-11-14T11:19:42","modified_gmt":"2011-11-14T10:19:42","slug":"indexdocvalues-their-applications","status":"publish","type":"post","link":"https:\/\/trifork.nl\/blog\/indexdocvalues-their-applications\/","title":{"rendered":"IndexDocValues &#8211; their applications"},"content":{"rendered":"<p id=\"aui_3_2_0_1991\">From a user\u2019s perspective Lucene\u2019s IndexDocValues is a bunch of values per document. Unlike Stored Fields or FieldCache, the IndexDocValues\u2019 values can be retrieved quickly and efficiently as Simon Willnauer describes in his first <a href=\"http:\/\/blog.trifork.nl\/2011\/10\/27\/introducing-lucene-index-doc-values\/\">IndexDocValues<\/a> blog post. There are many applications that can benefit from using IndexDocValues for search functionality like <a href=\"http:\/\/blog.trifork.nl\/2011\/11\/16\/apache-lucene-flexiblescoring-with-indexdocvalues\/\">flexible scoring<\/a>, faceting, sorting, joining and result grouping.<\/p>\n<p class=\"p1\"><span class=\"s1\">In general when you have functionality like sorting, faceting and result grouping you usually need an extra dedicated field. Let&#8217;s take an <em>author<\/em> field as an example; you may tokenize values for fulltext search which can cause faceting or result grouping to behave unexpectedly. However, by simply adding a untokenized field with a prefix like<em> facet_author<\/em> would fix this problem. <\/span><\/p>\n<p class=\"p1\"><span class=\"s1\">Yet, IndexDocValues are not tokenized or manipulated so you can simply add a IndexDocValuesField with the same name to achieve the same as with an additional field. Using IndexDocValues you end up with less logical fields and in this case less is more.<\/span><\/p>\n<p class=\"p1\"><span class=\"s1\">Some applications that use Lucene run in a memory limited environment, consequently functionalities like grouping and sorting cannot be used with a medium to large index. IndexDocValues can be retrieved in disk resident manner, meaning that document values aren\u2019t loaded into the Java heap space. Applications using hundreds of fields where each of them could be used for sorting can quickly run out of memory when FieldCaches are pulled. Disk resident IndexDocValues can help to overcome this limitation by paying a price for search performance regression between 40% and 80%.<\/span><\/p>\n<p class=\"p1\"><span class=\"s1\">At this stage a lot of Lucene \/ Solr\u2019s features are trying to catch up with IndexDocValues but some of them are already there. Let\u2019s take a closer look at some of the features they already have.\u00a0<\/span><\/p>\n<h1><span class=\"s1\">Solr Schema<\/span><\/h1>\n<p class=\"p1\"><span class=\"s1\">As mentioned using IndexDocValues you end up with less logical fields. In Solr however this should even be simpler. You should just be able to mark a field to use the IndexDocValues but unfortunately this has not yet been exposed in Solr&#8217;s <em>schema.xml<\/em>. However there is an issue open (<a href=\"https:\/\/issues.apache.org\/jira\/browse\/SOLR-2753\">SOLR-2753<\/a>) aiming to address this limitation if you are in the need of using IndexDocValues &amp; you are using the Solr trunk it&#8217;s a great opportunity to get involved into the Solr development community. If you feel like it give it a go and code something up.<\/span><\/p>\n<h1><span class=\"s1\">Result grouping<\/span><\/h1>\n<p class=\"p1\"><span class=\"s1\">Result grouping by IndexDocValues is still work in progress but will be committed within the next week or two. Making use of IndexDocValues for result grouping is as straight forward as grouping by regular indexed values via FieldCache. If you are in Lucene land you can simply use a IndexDocValues GroupingCollectors and get started. (see Listing 1)<\/span><\/p>\n<pre class=\"brush: java; title: ; notranslate\" title=\"\"> Query query = new TermQuery(new Term(&quot;description&quot;, &quot;lucene&quot;));\nSort groupSort = new Sort(new SortField(&quot;name&quot;, SortField.Type.STRING));\nSort sortWithinGroup = new Sort(new SortField(&quot;age&quot;, SortField.Type.INT));\n\nint maxGroupsToCollect = 10;\nint maxDocsPerGroup = &quot;3&quot;;\nString groupField = &quot;groupField&quot;;\nValueType docValuesType = ValueType.VAR_INTS;\nboolean diskResident = true;\nint groupOffset = 0;\nint offsetInGroup = 0;\n\nAbstractFirstPassGroupingCollector first = DVFirstPassGroupingCollector.create(\n        groupSort, maxGroupsToCollect, groupField, docValuesType, diskResident\n);\n\nsearcher.search(query, first);\n\nCollection&amp;lt;SearchGroup&amp;lt;BytesRef&amp;gt;&amp;gt; topGroups = first.getTopGroups(groupOffset, false);\nAbstractSecondPassGroupingCollector&amp;lt;BytesRef&amp;gt; second = DVSecondPassGroupingCollector.create(\n        groupField, diskResident, docValuesType, topGroups, groupSort, sortWithinGroup, maxDocsPerGroup, true, true, false\n);\n\nTopGroups&amp;lt;BytesRef&amp;gt; result = second.getTopGroups(offsetInGroup);<\/pre>\n<div class=\"portlet-msg-info\">Listing 1: Use IndexDocValues with ResultGrouping<\/div>\n<p class=\"p1\"><span class=\"s1\"><strong>Tip:<\/strong> If you\u2019re dealing with IndexSearcher instances use Lucene\u2019s <a href=\"http:\/\/blog.mikemccandless.com\/2011\/09\/lucenes-searchermanager-simplifies.html\" target=\"_blank\" rel=\"noopener\">SearcherManager<\/a><\/span><\/p>\n<p class=\"p1\">Note, the code is the same for all other types of grouping collectors as described in the <a href=\"http:\/\/lucene.apache.org\/java\/3_4_0\/api\/contrib-grouping\/org\/apache\/lucene\/search\/grouping\/package-summary.html\" target=\"_blank\" rel=\"noopener\">grouping contrib page<\/a>. The only new options are <strong>diskResident<\/strong> and <strong>valueType<\/strong>. The <strong>diskResident<\/strong> option controls whether the values aren&#8217;t loaded in the Java heapspace but read from disk directly utilizing the filesystems memory. The <strong>valueType<\/strong> defines the type of the value like <a href=\"https:\/\/builds.apache.org\/job\/Lucene-trunk\/javadoc\/core\/org\/apache\/lucene\/index\/values\/ValueType.html#FIXED_INTS_32\">FIXED_INT_32<\/a>, <a href=\"https:\/\/builds.apache.org\/job\/Lucene-trunk\/javadoc\/core\/org\/apache\/lucene\/index\/values\/ValueType.html#FLOAT_32\"><em>FLOAT_32<\/em><\/a> or <em><a href=\"https:\/\/builds.apache.org\/job\/Lucene-trunk\/javadoc\/core\/org\/apache\/lucene\/index\/values\/ValueType.html#BYTES_FIXED_STRAIGHT\">BYTES_FIXED_STRAIGHT<\/a><\/em>. The ValueType variants also provide flexibility for how those values are stored.\u00a0When indexing IndexDocValues you have to define &#8220;what&#8221; and &#8220;how&#8221; you are storing which should be consistent throughout your indexing process. However, Lucene provides some under-the-hood TypePromotion if you index different types on the same field across different indexing sessions.<\/p>\n<p class=\"p1\"><strong>Tip<\/strong>: The bytes type has a few variants. One variant is sorted. Using this variant you will see that you gain much more performance when using grouping.<\/p>\n<p class=\"p1\"><span class=\"Apple-style-span\" style=\"font-size: 20px;font-weight: bold;line-height: 27px\">Sorting<\/span><\/p>\n<p id=\"aui_3_2_0_1914\" class=\"p1\"><span class=\"s1\">Sorting by float and int and bytes typed doc values are already committed to the trunk and can be used in Lucene land. Making use of this feature for sorting follows the same path as FieldCache based sorting. By default Lucene uses the FieldCache unless you specify the <a href=\"http:\/\/builds.apache.org\/\/job\/Lucene-trunk\/javadoc\/core\/org\/apache\/lucene\/search\/SortField.html#setUseIndexValues(boolean)\">SortField#setUseIndexValues(boolean)<\/a><\/span><\/p>\n<pre class=\"brush: java; title: ; notranslate\" title=\"\">IndexSearcher searcher = new IndexSearcher(...);\nSortField field = new SortField(&quot;myField&quot;, SortField.Type.STRING);\nfield.setUseIndexValues(true);\nSort sort = new Sort(field);\nTopDocs topDocs = searcher.search(query, sort);<\/pre>\n<div class=\"portlet-msg-info\">Listing 2: Use IndexDocValues with sorting<\/div>\n<p>In general, using IndexDocValues instead of FieldCache is pretty straight forward. No additional API changes are required.<\/p>\n<h1>Conclusion<\/h1>\n<p>Besides sorting and result grouping there is more IndexDocValues work to do. Other features such as faceting and joining should have doc value based implementations as well.<\/p>\n<p>I think that slowly features that use indexed term based implementations will move towards doc values based implementations. In my opinion, this is the way forward, then the inverted index is really used what it was built for in the first place: <strong>free text search<\/strong>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>From a user\u2019s perspective Lucene\u2019s IndexDocValues is a bunch of values per document. Unlike Stored Fields or FieldCache, the IndexDocValues\u2019 values can be retrieved quickly and efficiently as Simon Willnauer describes in his first IndexDocValues blog post. There are many applications that can benefit from using IndexDocValues for search functionality like flexible scoring, faceting, sorting, [&hellip;]<\/p>\n","protected":false},"author":77,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":""},"categories":[15,65],"tags":[35,33],"class_list":["post-6391","post","type-post","status-publish","format-standard","hentry","category-enterprise-search","category-big_data_search","tag-lucene","tag-solr"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v24.4 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>IndexDocValues - their applications - Trifork Blog<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/trifork.nl\/blog\/indexdocvalues-their-applications\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"IndexDocValues - their applications - Trifork Blog\" \/>\n<meta property=\"og:description\" content=\"From a user\u2019s perspective Lucene\u2019s IndexDocValues is a bunch of values per document. Unlike Stored Fields or FieldCache, the IndexDocValues\u2019 values can be retrieved quickly and efficiently as Simon Willnauer describes in his first IndexDocValues blog post. There are many applications that can benefit from using IndexDocValues for search functionality like flexible scoring, faceting, sorting, [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/trifork.nl\/blog\/indexdocvalues-their-applications\/\" \/>\n<meta property=\"og:site_name\" content=\"Trifork Blog\" \/>\n<meta property=\"article:published_time\" content=\"2011-11-14T10:19:42+00:00\" \/>\n<meta name=\"author\" content=\"Martijn van Groningen\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Martijn van Groningen\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/trifork.nl\/blog\/indexdocvalues-their-applications\/\",\"url\":\"https:\/\/trifork.nl\/blog\/indexdocvalues-their-applications\/\",\"name\":\"IndexDocValues - their applications - Trifork Blog\",\"isPartOf\":{\"@id\":\"https:\/\/trifork.nl\/blog\/#website\"},\"datePublished\":\"2011-11-14T10:19:42+00:00\",\"author\":{\"@id\":\"https:\/\/trifork.nl\/blog\/#\/schema\/person\/72d3e6a70910facfdef86dd93ced0e57\"},\"breadcrumb\":{\"@id\":\"https:\/\/trifork.nl\/blog\/indexdocvalues-their-applications\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/trifork.nl\/blog\/indexdocvalues-their-applications\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/trifork.nl\/blog\/indexdocvalues-their-applications\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/trifork.nl\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"IndexDocValues &#8211; their applications\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/trifork.nl\/blog\/#website\",\"url\":\"https:\/\/trifork.nl\/blog\/\",\"name\":\"Trifork Blog\",\"description\":\"Keep updated on the technical solutions Trifork is working on!\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/trifork.nl\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/trifork.nl\/blog\/#\/schema\/person\/72d3e6a70910facfdef86dd93ced0e57\",\"name\":\"Martijn van Groningen\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/trifork.nl\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/505caa844fb66f275a027798c993c363?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/505caa844fb66f275a027798c993c363?s=96&d=mm&r=g\",\"caption\":\"Martijn van Groningen\"},\"url\":\"https:\/\/trifork.nl\/blog\/author\/martijn\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"IndexDocValues - their applications - Trifork Blog","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/trifork.nl\/blog\/indexdocvalues-their-applications\/","og_locale":"en_US","og_type":"article","og_title":"IndexDocValues - their applications - Trifork Blog","og_description":"From a user\u2019s perspective Lucene\u2019s IndexDocValues is a bunch of values per document. Unlike Stored Fields or FieldCache, the IndexDocValues\u2019 values can be retrieved quickly and efficiently as Simon Willnauer describes in his first IndexDocValues blog post. There are many applications that can benefit from using IndexDocValues for search functionality like flexible scoring, faceting, sorting, [&hellip;]","og_url":"https:\/\/trifork.nl\/blog\/indexdocvalues-their-applications\/","og_site_name":"Trifork Blog","article_published_time":"2011-11-14T10:19:42+00:00","author":"Martijn van Groningen","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Martijn van Groningen","Est. reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/trifork.nl\/blog\/indexdocvalues-their-applications\/","url":"https:\/\/trifork.nl\/blog\/indexdocvalues-their-applications\/","name":"IndexDocValues - their applications - Trifork Blog","isPartOf":{"@id":"https:\/\/trifork.nl\/blog\/#website"},"datePublished":"2011-11-14T10:19:42+00:00","author":{"@id":"https:\/\/trifork.nl\/blog\/#\/schema\/person\/72d3e6a70910facfdef86dd93ced0e57"},"breadcrumb":{"@id":"https:\/\/trifork.nl\/blog\/indexdocvalues-their-applications\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/trifork.nl\/blog\/indexdocvalues-their-applications\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/trifork.nl\/blog\/indexdocvalues-their-applications\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/trifork.nl\/blog\/"},{"@type":"ListItem","position":2,"name":"IndexDocValues &#8211; their applications"}]},{"@type":"WebSite","@id":"https:\/\/trifork.nl\/blog\/#website","url":"https:\/\/trifork.nl\/blog\/","name":"Trifork Blog","description":"Keep updated on the technical solutions Trifork is working on!","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/trifork.nl\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/trifork.nl\/blog\/#\/schema\/person\/72d3e6a70910facfdef86dd93ced0e57","name":"Martijn van Groningen","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/trifork.nl\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/505caa844fb66f275a027798c993c363?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/505caa844fb66f275a027798c993c363?s=96&d=mm&r=g","caption":"Martijn van Groningen"},"url":"https:\/\/trifork.nl\/blog\/author\/martijn\/"}]}},"_links":{"self":[{"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/posts\/6391","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/users\/77"}],"replies":[{"embeddable":true,"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/comments?post=6391"}],"version-history":[{"count":0,"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/posts\/6391\/revisions"}],"wp:attachment":[{"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/media?parent=6391"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/categories?post=6391"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/tags?post=6391"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}