Document Frequency Limited MultiTermQuerys

by Chris MaleMarch 19, 2012

If you’ve ever looked at user generated data such as tweets, forum comments or even SMS text messages, you’ll have noticed there there are many variations in the spelling of words.  In some cases they are intentional such as omissions of vowels to reduce message length, in other cases they are unintentional typos and spelling mistakes.

Querying this kind of data since only matching the traditional spelling of a word can lead to many valid results being missed.  One way to includes matches on variations of a word is to use Lucene’s MultiTermQuerys such as FuzzyQuery or WildcardQuery.  For example, to find matches for the word “hotel” and all its variations, you might use the queries “hotel~” and “h*t*l”.  Unfortunately, depending on how many variations there are, the queries could end up matching 10s or even 100s of terms, which will impact your performance.

You might be willing to accept this performance degradation to capture all the variations, or you might want to only query those terms which are common in your index, dropping the infrequent variations and giving your users maximum results with little impact on performance.

Lets explore how you can focus your MultiTermQuerys on the most common terms in your index.

MultiTermQuery Rewriting & FilteredTermsEnum

Before getting into any code, lets quickly discuss how a MultiTermQuery like FuzzyQuery works.  MultiTermQuerys can be tought of as a high-level Queries; they aren’t directly executable but instead must be rewritten into an executable Query such as BooleanQuery using Query.rewrite(IndexReader).  Internally, MultiTermQuery.rewrite(...) retrieves a FilteredTermsEnum from its concrete implementation and iterates over its terms, adding them to the executable Query.

For FuzzyQuery, the FilteredTermsEnum will return all the terms in the index that are within the fuzzy margin of the target term.  For “hotel~” this could be “hotel”, “hot3l”, “hotl” or “hotal”.

It is this FilteredTermsEnum that we can manipulate to weed out those terms which are infrequent.

DocFreqFilteredTermsEnum

The following code is an implementation of FilteredTermsEnum which wraps another and filters out those terms which are found in less than a defined percentage of Documents in the index:

Lucene 3.x

(Note, in Lucene 3.x FilteredTermsEnum is called FilteredTermEnum)

 public class DocFreqFilteredTermEnum extends FilteredTermEnum {

  private final float rewrittenTermDocFreqPercent;
  private final int maxDoc;

  public DocFreqFilteredTermEnum(
            FilteredTermEnum termEnum,
            float rewrittenTermDocFreqPercent,
            int maxDoc) throws IOException {
    this.rewrittenTermDocFreqPercent = rewrittenTermDocFreqPercent;
    this.maxDoc = maxDoc;
    setEnum(termEnum);
  }

  @Override
  protected boolean termCompare(Term term) {
    return ((float) actualEnum.docFreq() / maxDoc) >= rewrittenTermDocFreqPercent;
  }

  @Override
  public float difference() {
    return ((FilteredTermEnum) actualEnum).difference();
  }

  @Override
  protected boolean endEnum() {
    // Keep going and let the wrapped FilteredTermEnum decide when to stop
    return false;
  }

}

This code can be used with a FuzzyQuery to filter out those terms which occur in less than 1% of Documents as follows:

 FuzzyQuery fuzzyQuery = new FuzzyQuery(term) {
  @Override
  protected FilteredTermEnum getEnum(IndexReader reader) throws IOException {
    return new DocFreqFilteredTermEnum(super.getEnum(reader), 0.01, reader.maxDoc());
  }
};

Or alternatively, as of Lucene 3.6 you can override RewriteMethod.getTermsEnum(IndexReader, MultiTermQuery) to return DocFreqFilteredTermEnum, so that you don’t need any subclass any MultiTermQuery implementation.

Lucene 4.x

 public class DocFreqFilteredTermsEnum extends FilteredTermsEnum {

  private final float rewrittenTermDocFreqPercent;
  private final int maxDoc;

  public DocFreqFilteredTermsEnum(
            FilteredTermsEnum termEnum,
            float rewrittenTermDocFreqPercent,
            int maxDoc) throws IOException {
    super(termsEnum);
    this.rewrittenTermDocFreqPercent = rewrittenTermDocFreqPercent;
    this.maxDoc = maxDoc;
  }

  @Override
  protected AcceptStatus accept(BytesRef term) {
    return (((float) actualEnum.docFreq() / maxDoc) >= rewrittenTermDocFreqPercent) ?
    AcceptStatus.YES : AcceptStatus.NO_AND_SEEK;
  }
}
</pre> <p> Again to use this code:</p> <pre class="brush:java"> FuzzyQuery fuzzyQuery = new FuzzyQuery(term) {
  @Override
  protected FilteredTermEnum getTermsEnum(IndexReader reader) throws IOException {
    return new DocFreqFilteredTermsEnum(super.getTermsEnum(reader), 0.01, reader.maxDoc());
  }
};

And again, like Lucene 3.6, you can alternatively override RewriteMethod.getTermsEnum(IndexReader, MultiTermQuery).

Conclusion

Having control over the terms used in a MultiTermQuery can considerably improve your search performance while still allowing you to overcome spelling variations.  Have you encountered a scenario where you’ve been executing MultiTermQuerys over use generated data? Did they impact the performance of your searches? Was that due to the larger number of terms being queried? If so, we’d love to hear about your experiences.