{"id":6423,"date":"2012-01-19T19:09:34","date_gmt":"2012-01-19T18:09:34","guid":{"rendered":"http:\/\/blog.trifork.nl\/?p=6423"},"modified":"2012-01-19T19:09:34","modified_gmt":"2012-01-19T18:09:34","slug":"simon-says-single-byte-norms-are-dead","status":"publish","type":"post","link":"https:\/\/trifork.nl\/blog\/simon-says-single-byte-norms-are-dead\/","title":{"rendered":"Simon says: Single Byte Norms are Dead!"},"content":{"rendered":"<p class=\"p1\"><span class=\"s1\"><a href=\"http:\/\/lucene.apache.org\/\">Apache Lucene<\/a><\/span> turned 10 last year with a limitation that bugged many many users from day one. You may know Lucene\u2019s core scoring model is based on <a href=\"http:\/\/en.wikipedia.org\/wiki\/Tf%E2%80%93idf\"><span class=\"s1\">TF\/IDF<\/span><\/a> (<a href=\"http:\/\/en.wikipedia.org\/wiki\/Vector_space_model\"><span class=\"s1\">Vector Space Model<\/span><\/a>). Lucene encapsulates all related calculations in a class called Similarity. Among pure TF\/IDF factors Similarity also provides a norm value per document that is, by default a float value composed out of length normalization and field boost. Nothing special so far! However, this float value is encoded into a single byte and written down to the index. Lucene trades some precision lost for space on disk and eventually in memory since <i>norms<\/i> are loaded into memory per field upon first access.<\/p>\n<p class=\"p1\">In lots of cases this precision lost is a fair trade-off, but once you find yourself in a situation where you need to store more information based on statistics collected during indexing you end up writing your own auxiliary data structure or \u201cfork\u201d Lucene for your app and mess with the source.<\/p>\n<p class=\"p3\">The upcoming version of Lucene already added support for a lot more scoring models like:<\/p>\n<ul class=\"ul1\">\n<li class=\"li3\"><a href=\"http:\/\/theses.gla.ac.uk\/1570\/\"><span class=\"s2\">Divergence from Randomness<\/span><\/a><\/li>\n<li class=\"li3\"><a href=\"http:\/\/en.wikipedia.org\/wiki\/Language_model\"><span class=\"s2\">Language Models\u00a0<\/span><\/a><\/li>\n<li class=\"li3\"><a href=\"http:\/\/dl.acm.org\/citation.cfm?id=1835490\"><span class=\"s2\">Information Based Models<\/span><\/a><\/li>\n<li class=\"li3\"><a href=\"http:\/\/en.wikipedia.org\/wiki\/Okapi_BM25\"><span class=\"s2\">Okapi BM25<\/span><\/a><\/li>\n<\/ul>\n<p class=\"p2\">The abstractions added to Lucene to implement those models already opens the door for applications that either want to roll their own \u201cawesome\u201d scoring model or modify the low level scorer implementations. Yet, norms are still one byte!<\/p>\n<h3><b>Use DocValues to add scoring factors<\/b><\/h3>\n<p class=\"p4\">Lets look at some real world examples where additional scoring factors are required. Imagine you have a hierarchical system like a geo-search application and beside your ordinary document boost you also want to boost your documents based on their hierarchical entity.<\/p>\n<p class=\"p2\">In the previous Lucene version you could add the hierarchy information as a payload for each term and fetch it at runtime. Using payloads however have multiple downsides; besides the redundant data you are storing you also sometimes have massive runtime costs. In Lucene 4 you can use <a href=\"http:\/\/blog.trifork.nl\/2011\/10\/27\/introducing-lucene-index-doc-values\/\"><span class=\"s1\">DocValues<\/span><\/a> to store your boosts \/ categories per document in a nice &amp; efficient datastructure; one of my <a href=\"http:\/\/blog.trifork.nl\/2011\/11\/16\/apache-lucene-flexiblescoring-with-indexdocvalues\/\"><span class=\"s1\">previous posts<\/span><\/a> shows how they can be used for scoring. Nice, problem solved! But wait, norms are still one byte?<\/p>\n<h3><b>Exposing the power of DocValues via Similarity<\/b><\/h3>\n<p class=\"p3\">The limitation of not being able to emit more than 8bits worth of information during indexing becomes important once you are looking into machine-learing problems where boosts need to be fine grained or alternative scoring models like the <b>lnu<\/b> \u009fweighting scheme <i>(Singhal et. al. in New Retrieval Approaches in SMART: TREC 4)<\/i>. The <b>lnu <\/b>scheme needs to store the number of unique terms in a document as well as the total number of terms in the document. Those values won\u2019t fit in a single byte nor can you store them in a simple DocValues field since this information is not available until the document is indexed.<\/p>\n<p class=\"p2\">To overcome this limitation, Lucene 4 replaces it\u2019s old single byte norms implementation in favor of <i>DocValues<\/i> which are already available for other per-document values. This already happened two weeks <a href=\"https:\/\/issues.apache.org\/jira\/browse\/LUCENE-3628\"><span class=\"s1\">ago<\/span><\/a> but simply changing the implementation didn\u2019t help much. Lucene needed a way to allow the user to encode the information from <a href=\"http:\/\/lucene.apache.org\/java\/3_5_0\/api\/core\/org\/apache\/lucene\/index\/FieldInvertState.html\"><span class=\"s1\">FieldInvertState<\/span><\/a> into the index. We decided to keep the notion of a Norm but allow the user to write out arbitrary fixed size values directly from the <a href=\"http:\/\/lucene.apache.org\/java\/3_5_0\/api\/core\/org\/apache\/lucene\/search\/Similarity.html\"><span class=\"s1\">Similarity<\/span><\/a>. \u00a0We <a href=\"https:\/\/issues.apache.org\/jira\/browse\/LUCENE-3687\"><span class=\"s1\">changed<\/span><\/a> the Similarity#computeNorm method accordingly:<\/p>\n<pre class=\"brush:java\"> @Override\npublic void computeNorm(FieldInvertState state, Norm norm) {\n \u00a0\u00a0\u00a0norm.setInt(state.getLength() &lt;&lt; 16 | state.getUniqueTermCount());\n}<\/pre>\n<div class=\"portlet-msg-info\">Figure 1: Encoding field length and num unique terms as a norm value in Lucene 4.0<\/div>\n<p class=\"p3\">This allows you to encode norms in Lucene 4 with the precision and the information you need for your application especially if the defaults don\u2019t suit your needs. Good luck and let us know how you get on!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Apache Lucene turned 10 last year with a limitation that bugged many many users from day one. You may know Lucene\u2019s core scoring model is based on TF\/IDF (Vector Space Model). Lucene encapsulates all related calculations in a class called Similarity. Among pure TF\/IDF factors Similarity also provides a norm value per document that is, [&hellip;]<\/p>\n","protected":false},"author":107,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":""},"categories":[15,65],"tags":[35,33,61],"class_list":["post-6423","post","type-post","status-publish","format-standard","hentry","category-enterprise-search","category-big_data_search","tag-lucene","tag-solr","tag-elasticsearch"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v24.4 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Simon says: Single Byte Norms are Dead! - Trifork Blog<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/trifork.nl\/blog\/simon-says-single-byte-norms-are-dead\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Simon says: Single Byte Norms are Dead! - Trifork Blog\" \/>\n<meta property=\"og:description\" content=\"Apache Lucene turned 10 last year with a limitation that bugged many many users from day one. You may know Lucene\u2019s core scoring model is based on TF\/IDF (Vector Space Model). Lucene encapsulates all related calculations in a class called Similarity. Among pure TF\/IDF factors Similarity also provides a norm value per document that is, [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/trifork.nl\/blog\/simon-says-single-byte-norms-are-dead\/\" \/>\n<meta property=\"og:site_name\" content=\"Trifork Blog\" \/>\n<meta property=\"article:published_time\" content=\"2012-01-19T18:09:34+00:00\" \/>\n<meta name=\"author\" content=\"Simon Willnauer\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Simon Willnauer\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"3 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/trifork.nl\/blog\/simon-says-single-byte-norms-are-dead\/\",\"url\":\"https:\/\/trifork.nl\/blog\/simon-says-single-byte-norms-are-dead\/\",\"name\":\"Simon says: Single Byte Norms are Dead! - Trifork Blog\",\"isPartOf\":{\"@id\":\"https:\/\/trifork.nl\/blog\/#website\"},\"datePublished\":\"2012-01-19T18:09:34+00:00\",\"author\":{\"@id\":\"https:\/\/trifork.nl\/blog\/#\/schema\/person\/88be6f0de12503d08f3d5f18796e4051\"},\"breadcrumb\":{\"@id\":\"https:\/\/trifork.nl\/blog\/simon-says-single-byte-norms-are-dead\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/trifork.nl\/blog\/simon-says-single-byte-norms-are-dead\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/trifork.nl\/blog\/simon-says-single-byte-norms-are-dead\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/trifork.nl\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Simon says: Single Byte Norms are Dead!\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/trifork.nl\/blog\/#website\",\"url\":\"https:\/\/trifork.nl\/blog\/\",\"name\":\"Trifork Blog\",\"description\":\"Keep updated on the technical solutions Trifork is working on!\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/trifork.nl\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/trifork.nl\/blog\/#\/schema\/person\/88be6f0de12503d08f3d5f18796e4051\",\"name\":\"Simon Willnauer\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/trifork.nl\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/254a556e9dde04a2d02ed76e5971a0fd?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/254a556e9dde04a2d02ed76e5971a0fd?s=96&d=mm&r=g\",\"caption\":\"Simon Willnauer\"},\"description\":\"I am a Apache Lucene PMC and core committer and work mainly on scalable distributed information retrieval systems as well as the Lucene core engine. I'm also a co-organizer of BerlinBuzzwords (http:\/\/www.berlinbuzzwords.de) an annual conference on Scalability Berlin.\",\"sameAs\":[\"http:\/\/www.jteam.nl\"],\"url\":\"https:\/\/trifork.nl\/blog\/author\/simonw\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Simon says: Single Byte Norms are Dead! - Trifork Blog","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/trifork.nl\/blog\/simon-says-single-byte-norms-are-dead\/","og_locale":"en_US","og_type":"article","og_title":"Simon says: Single Byte Norms are Dead! - Trifork Blog","og_description":"Apache Lucene turned 10 last year with a limitation that bugged many many users from day one. You may know Lucene\u2019s core scoring model is based on TF\/IDF (Vector Space Model). Lucene encapsulates all related calculations in a class called Similarity. Among pure TF\/IDF factors Similarity also provides a norm value per document that is, [&hellip;]","og_url":"https:\/\/trifork.nl\/blog\/simon-says-single-byte-norms-are-dead\/","og_site_name":"Trifork Blog","article_published_time":"2012-01-19T18:09:34+00:00","author":"Simon Willnauer","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Simon Willnauer","Est. reading time":"3 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/trifork.nl\/blog\/simon-says-single-byte-norms-are-dead\/","url":"https:\/\/trifork.nl\/blog\/simon-says-single-byte-norms-are-dead\/","name":"Simon says: Single Byte Norms are Dead! - Trifork Blog","isPartOf":{"@id":"https:\/\/trifork.nl\/blog\/#website"},"datePublished":"2012-01-19T18:09:34+00:00","author":{"@id":"https:\/\/trifork.nl\/blog\/#\/schema\/person\/88be6f0de12503d08f3d5f18796e4051"},"breadcrumb":{"@id":"https:\/\/trifork.nl\/blog\/simon-says-single-byte-norms-are-dead\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/trifork.nl\/blog\/simon-says-single-byte-norms-are-dead\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/trifork.nl\/blog\/simon-says-single-byte-norms-are-dead\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/trifork.nl\/blog\/"},{"@type":"ListItem","position":2,"name":"Simon says: Single Byte Norms are Dead!"}]},{"@type":"WebSite","@id":"https:\/\/trifork.nl\/blog\/#website","url":"https:\/\/trifork.nl\/blog\/","name":"Trifork Blog","description":"Keep updated on the technical solutions Trifork is working on!","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/trifork.nl\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/trifork.nl\/blog\/#\/schema\/person\/88be6f0de12503d08f3d5f18796e4051","name":"Simon Willnauer","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/trifork.nl\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/254a556e9dde04a2d02ed76e5971a0fd?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/254a556e9dde04a2d02ed76e5971a0fd?s=96&d=mm&r=g","caption":"Simon Willnauer"},"description":"I am a Apache Lucene PMC and core committer and work mainly on scalable distributed information retrieval systems as well as the Lucene core engine. I'm also a co-organizer of BerlinBuzzwords (http:\/\/www.berlinbuzzwords.de) an annual conference on Scalability Berlin.","sameAs":["http:\/\/www.jteam.nl"],"url":"https:\/\/trifork.nl\/blog\/author\/simonw\/"}]}},"_links":{"self":[{"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/posts\/6423","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/users\/107"}],"replies":[{"embeddable":true,"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/comments?post=6423"}],"version-history":[{"count":0,"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/posts\/6423\/revisions"}],"wp:attachment":[{"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/media?parent=6423"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/categories?post=6423"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/tags?post=6423"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}