Simon says: Single Byte Norms are Dead!

by Simon WillnauerJanuary 19, 2012

Apache Lucene turned 10 last year with a limitation that bugged many many users from day one. You may know Lucene’s core scoring model is based on TF/IDF (Vector Space Model). Lucene encapsulates all related calculations in a class called Similarity. Among pure TF/IDF factors Similarity also provides a norm value per document that is, by default a float value composed out of length normalization and field boost. Nothing special so far! However, this float value is encoded into a single byte and written down to the index. Lucene trades some precision lost for space on disk and eventually in memory since norms are loaded into memory per field upon first access.

In lots of cases this precision lost is a fair trade-off, but once you find yourself in a situation where you need to store more information based on statistics collected during indexing you end up writing your own auxiliary data structure or “fork” Lucene for your app and mess with the source.

The upcoming version of Lucene already added support for a lot more scoring models like:

The abstractions added to Lucene to implement those models already opens the door for applications that either want to roll their own “awesome” scoring model or modify the low level scorer implementations. Yet, norms are still one byte!

Use DocValues to add scoring factors

Lets look at some real world examples where additional scoring factors are required. Imagine you have a hierarchical system like a geo-search application and beside your ordinary document boost you also want to boost your documents based on their hierarchical entity.

In the previous Lucene version you could add the hierarchy information as a payload for each term and fetch it at runtime. Using payloads however have multiple downsides; besides the redundant data you are storing you also sometimes have massive runtime costs. In Lucene 4 you can use DocValues to store your boosts / categories per document in a nice & efficient datastructure; one of my previous posts shows how they can be used for scoring. Nice, problem solved! But wait, norms are still one byte?

Exposing the power of DocValues via Similarity

The limitation of not being able to emit more than 8bits worth of information during indexing becomes important once you are looking into machine-learing problems where boosts need to be fine grained or alternative scoring models like the lnu Ÿweighting scheme (Singhal et. al. in New Retrieval Approaches in SMART: TREC 4). The lnu scheme needs to store the number of unique terms in a document as well as the total number of terms in the document. Those values won’t fit in a single byte nor can you store them in a simple DocValues field since this information is not available until the document is indexed.

To overcome this limitation, Lucene 4 replaces it’s old single byte norms implementation in favor of DocValues which are already available for other per-document values. This already happened two weeks ago but simply changing the implementation didn’t help much. Lucene needed a way to allow the user to encode the information from FieldInvertState into the index. We decided to keep the notion of a Norm but allow the user to write out arbitrary fixed size values directly from the Similarity.  We changed the Similarity#computeNorm method accordingly:

public void computeNorm(FieldInvertState state, Norm norm) {
    norm.setInt(state.getLength() << 16 | state.getUniqueTermCount());
Figure 1: Encoding field length and num unique terms as a norm value in Lucene 4.0

This allows you to encode norms in Lucene 4 with the precision and the information you need for your application especially if the defaults don’t suit your needs. Good luck and let us know how you get on!