Document Frequency Limited MultiTermQuerys
If you’ve ever looked at user generated data such as tweets, forum comments or even SMS text messages, you’ll have noticed there there are many variations in the spelling of words. In some cases they are intentional such as omissions of vowels to reduce message length, in other cases they are unintentional typos and spelling mistakes.
Querying this kind of data since only matching the traditional spelling of a word can lead to many valid results being missed. One way to includes matches on variations of a word is to use Lucene’s MultiTermQuerys
such as FuzzyQuery
or WildcardQuery
. For example, to find matches for the word “hotel” and all its variations, you might use the queries “hotel~” and “h*t*l”. Unfortunately, depending on how many variations there are, the queries could end up matching 10s or even 100s of terms, which will impact your performance.
You might be willing to accept this performance degradation to capture all the variations, or you might want to only query those terms which are common in your index, dropping the infrequent variations and giving your users maximum results with little impact on performance.
Lets explore how you can focus your MultiTermQuerys
on the most common terms in your index.
MultiTermQuery Rewriting & FilteredTermsEnum
Before getting into any code, lets quickly discuss how a MultiTermQuery
like FuzzyQuery
works. MultiTermQuerys
can be tought of as a high-level Queries; they aren’t directly executable but instead must be rewritten into an executable Query such as BooleanQuery
using Query.rewrite(IndexReader)
. Internally, MultiTermQuery.rewrite(...)
retrieves a FilteredTermsEnum
from its concrete implementation and iterates over its terms, adding them to the executable Query.
For FuzzyQuery, the FilteredTermsEnum will return all the terms in the index that are within the fuzzy margin of the target term. For “hotel~” this could be “hotel”, “hot3l”, “hotl” or “hotal”.
It is this FilteredTermsEnum
that we can manipulate to weed out those terms which are infrequent.
DocFreqFilteredTermsEnum
The following code is an implementation of FilteredTermsEnum
which wraps another and filters out those terms which are found in less than a defined percentage of Documents in the index:
Lucene 3.x
(Note, in Lucene 3.x FilteredTermsEnum
is called FilteredTermEnum
)
public class DocFreqFilteredTermEnum extends FilteredTermEnum { private final float rewrittenTermDocFreqPercent; private final int maxDoc; public DocFreqFilteredTermEnum( FilteredTermEnum termEnum, float rewrittenTermDocFreqPercent, int maxDoc) throws IOException { this.rewrittenTermDocFreqPercent = rewrittenTermDocFreqPercent; this.maxDoc = maxDoc; setEnum(termEnum); } @Override protected boolean termCompare(Term term) { return ((float) actualEnum.docFreq() / maxDoc) >= rewrittenTermDocFreqPercent; } @Override public float difference() { return ((FilteredTermEnum) actualEnum).difference(); } @Override protected boolean endEnum() { // Keep going and let the wrapped FilteredTermEnum decide when to stop return false; } }
This code can be used with a FuzzyQuery
to filter out those terms which occur in less than 1% of Documents as follows:
FuzzyQuery fuzzyQuery = new FuzzyQuery(term) { @Override protected FilteredTermEnum getEnum(IndexReader reader) throws IOException { return new DocFreqFilteredTermEnum(super.getEnum(reader), 0.01, reader.maxDoc()); } };
Or alternatively, as of Lucene 3.6 you can override RewriteMethod.getTermsEnum(IndexReader, MultiTermQuery)
to return DocFreqFilteredTermEnum
, so that you don’t need any subclass any MultiTermQuery
implementation.
Lucene 4.x
public class DocFreqFilteredTermsEnum extends FilteredTermsEnum { private final float rewrittenTermDocFreqPercent; private final int maxDoc; public DocFreqFilteredTermsEnum( FilteredTermsEnum termEnum, float rewrittenTermDocFreqPercent, int maxDoc) throws IOException { super(termsEnum); this.rewrittenTermDocFreqPercent = rewrittenTermDocFreqPercent; this.maxDoc = maxDoc; } @Override protected AcceptStatus accept(BytesRef term) { return (((float) actualEnum.docFreq() / maxDoc) >= rewrittenTermDocFreqPercent) ? AcceptStatus.YES : AcceptStatus.NO_AND_SEEK; } } </pre> <p> Again to use this code:</p> <pre class="brush:java"> FuzzyQuery fuzzyQuery = new FuzzyQuery(term) { @Override protected FilteredTermEnum getTermsEnum(IndexReader reader) throws IOException { return new DocFreqFilteredTermsEnum(super.getTermsEnum(reader), 0.01, reader.maxDoc()); } };
And again, like Lucene 3.6, you can alternatively override RewriteMethod.getTermsEnum(IndexReader, MultiTermQuery)
.
Conclusion
Having control over the terms used in a MultiTermQuery
can considerably improve your search performance while still allowing you to overcome spelling variations. Have you encountered a scenario where you’ve been executing MultiTermQuerys over use generated data? Did they impact the performance of your searches? Was that due to the larger number of terms being queried? If so, we’d love to hear about your experiences.