Apache Lucene FlexibleScoring with IndexDocValues

by Simon Willnauer, November 16, 2011

During GoogleSummerOfCode 2011 David Nemeskey, PhD student, proposed to improve Lucene’s scoring architecture and implement some state-of-the-art ranking models with the new framework. Prior to this and in all Lucene versions released so far the Vector-Space Model was tightly bound into Lucene. If you found yourself in a situation where another scoring model worked better for your usecase you basically had two choices; you either override all existing Scorers in Queries and implement your own model provided you have all the statistics available or you switch to some other search engine providing alternative models or extension points.

With Lucene 4.0 this is history! David Nemeskey and Robert Muir added an extensible API as well as index based statistics like Sum of Total Term Frequency or Sum of Document Frequency per Field to provide multiple scoring models. Lucene 4.0 comes with:

Lucene’s central scoring class Similarity has been extended to return dedicated Scorers like ExactDocScorer and SloppyDocScorer to calculate the actual score. This refactoring basically moved the actual score calculation out of the QueryScorer into a Similarity to allow implementing alternative scoring within a single method. Lucene 4.0 also comes with a new SimilarityProvider which lets you define a Similarity per field. Each field could use a slightly different similarity or incorporate additional scoring factors like IndexDocValues.

Boosting Similarity with IndexDocValues

Now that we have a selection of scoring models and the freedom to extend them we can tailor the scoring function exactly to our needs. Let’s look at a specific usecase – custom boosting. Imagine you indexed websites and calculated a pagerank but Lucene’s index-time boosting mechanism is not flexible enough for you, you could use IndexDocValues to store the page rank. First of all you need to get your data into Lucene ie. store the PageRank into a IndexDocValues field, Figure 1. shows an example.

IndexWriter writer = ...;
float pageRank = ...;
Document doc = new Document();
// add a standalone IndexDocValues field
IndexDocValuesField valuesField = new IndexDocValuesField("pageRank");
valuesField.setFloat(pageRank);
doc.add(valuesField);
doc.add(...); // add your title etc.
writer.addDocument(doc);
writer.commit();

Figure 1. Adding custom boost / score values as IndexDocValues

Once we have indexed our documents we can proceed to implement our Custom Similarity to incorporate the page rank into the document score. However, most of us won’t be in the situation that we can or want to come up with a entirely new scoring model so we are likely using one of the already existing scoring models available in Lucene. But even if we are not entirely sure which one we going to be using eventually we can already implement the PageRankSimilarity. (see Figure 2.)

public class PageRankSimilarity extends Similarity {

private final Similarity sim;

  public PageRankSimilarity(Similarity sim) {
    this.sim = sim; // wrap another similarity
  }

  @Override
  public ExactDocScorer exactDocScorer(Stats stats, String fieldName,
      AtomicReaderContext context) throws IOException {
    final ExactDocScorer sub = sim.exactDocScorer(stats, fieldName, context);
    // simply pull a IndexDocValues Source for the pageRank field
    final Source values = context.reader.docValues("pageRank").getSource();

    return new ExactDocScorer() {
      @Override
      public float score(int doc, int freq) {
        // multiply the pagerank into your score
        return (float) values.getFloat(doc) * sub.score(doc, freq);
      }
      @Override
      public Explanation explain(int doc, Explanation freq) {
        // implement explain here
      }
    };
  }
  @Override
  public byte computeNorm(FieldInvertState state) {
    return sim.computeNorm(state);
  }

  @Override
  public Stats computeStats(CollectionStatistics collectionStats,
                float queryBoost,TermStatistics... termStats) {
    return sim.computeStats(collectionStats, queryBoost, termStats);
  }
}

Figure 2. Custom Similarity delegate using IndexDocValues

With most calls delegated to some other Similarity of your choice, boosting documents by PageRank is as simple as it gets. All you need to do is to pull a Source from the IndexReader passed in via AtomicReaderContext (Atomic in this context means is a leave reader in the Lucene IndexReader hierarchy also referred to as a SegmentReader). The IndexDocValues#getSource() method will load the values for this field atomically on the first request and buffer them in memory until the reader goes out of scope (or until you manually unload them, I might cover that in a different post). Make sure you don’t use IndexDocValues#load() which will pull in the values for each invocation.

Can I use this in Apache Solr?

Apache Solr lets you already define custom similarities in its schema.xml file. Inside the <type> section you can define a custom similarity per <fieldType> as show in Figure 3 below.

<fieldType name="text" class="solr.TextField">
  <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
  <similarity class="solr.BM25SimilarityFactory">
    <float name="k1">1.2</float>
    <float name="b">0.76</float>
  </similarity>
</fieldType>

Figure 3. Using BM25 Scoring Model in Solr

Unfortunately, IndexDocValues are not yet exposed in Solr. There is an issue open aiming to add support for it without any progress yet. If you feel like you can benefit from IndexDocValues and all its features and you want to get involved into Apache Lucene & Solr feel free to comment on the issue. I’d be delighted to help you working towards IndexDocValues support in Solr!

What is next?

I didn’t decide on what is next in this series of posts but its likely yet another use case for IndexDocValues like Grouping and Sorting or we are going to look closer into how IndexDocValues are integrated into Lucene’s Flexible Indexing.