Query time joining in Lucene

by Martijn van Groningen, January 22, 2012

Recently query time joining has been added to the Lucene join module in the Lucene svn trunk. The query time joining will be included in the Lucene 4.0 release and there is a possibility that it will also be included in Lucene 3.6.

Lets say we have articles and comments. With the query time join you can store these entities as separate documents. Each comment and article can be updates without re-indexing large parts of your index. Even better would be to store articles in an article index and comments in a comment index! In both cases a comment would have a field containing the article identifier.

In a relational database it would look something like the image above.

Query time joining has been around in Solr for quite a while. It’s a really useful feature if you want to search with relational flavor. Prior to the query time join your index needed to be prepared in a specific way in order to search across different types of data. You could either use Lucene’s index time block join or merge your domain objects into one Lucene document. However, with the join query you can store different entities as separate documents which gives you more flexibility but comes with a runtime cost.

Query time joining in Lucene is pretty straight forward, and entirely encapsulated in JoinUtil.createJoinQuery. It requires the following arguments:

fromField. The from field to join from.
toField. The to field to join to.
fromQuery. The query executed to collect the from terms. This is usually the user specified query.
fromSearcher. The search on where the fromQuery is executed.
multipleValuesPerDocument. Whether the fromField contains more than one value per document (multivalued field). If this option is set to true the from terms can be collected in a more efficient manner.

The the static join method returns a query that can be executed on an IndexSearcher to retrieve all documents that have terms in the toField that match with the collected from terms. Only the entry point for joining is exposed to the user; the actual implementation completely hidden, allowing Lucene committers to change the implementation without breaking API backwards compatibility.

The query time joining is based on indexed terms and is currently implemented as two pass search. The first pass collects all the terms from a fromField (in our case the article identifier field) that match the fromQuery. The second pass returns all documents that have matching terms in a toField (in our case the article identifier field in a comment document) to the terms collected in the first pass.

The query that is returned from the static join method can also be executed on a different IndexSearcher than the IndexSearcher used as an argument in the static join method. This flexibility allows anyone to join data from different indexes; provided that the toField does exist in that index. In our example this means the article and comment data can reside in two different indices. The article index might not change very often, but the comment index might. This allows you to fine tune these indexes specific to each needs.

Lets see how one can use the query time joining! Assuming the we have indexed the content that is shown in the image above, we can now use the query time joining. Lets search for the comments that have ‘byte norms’ as article title:

 IndexSearcher articleSearcher = ...
IndexSearcher commentSearcher = ...
String fromField = "id";
boolean multipleValuesPerDocument = false;
String toField = "article_id";
// This query should yield article with id 2 as result
BooleanQuery fromQuery = new BooleanQuery();
fromQuery.add(new TermQuery(new Term("title", "byte")), BooleanClause.Occur.MUST);
fromQuery.add(new TermQuery(new Term("title", "norms")), BooleanClause.Occur.MUST);
Query joinQuery = JoinUtil.createJoinQuery(fromField, multipleValuesPerDocument, toField, fromQuery, articleSearcher);
TopDocs topDocs = commentSearcher.search(joinQuery, 10);

If you would run the above code snippet the topDocs would contain one hit. This hit would referer to the Lucene id of the comment which has value 1 in the field with name “id”. Instead of seeing the article as result you the comment that matches with the article that matches the user’s query.

You could also change the example and give all articles that match with a certain comment query. In this example the multipleValuesPerDocument is set to false and the fromField (the id field) only contains one value per document. However, the example would still work if multipleValuesPerDocument variable were set to true, but it would then work in a less efficient manner.

The query time joining isn’t finished yet. There is still work todo and we encourage you to help with!

Query time joining that uses doc values instead of the terms in the index. During text analysis the original text is in many cases changed. It might happen that your id is omitted or modified before it is added to the index. As you might expect this can result in unexpected behaviour during searching. A commonn work-around is to add an extra field to your index that doesn’t do text analysis. However this just adds a logical field that doesn’t actually adds meaning to your index. With docvalues you wouldn’t have an extra logical field and values are analysed.
More sophisticated caching. Currently not much caching happens. Documents that are frequently joined, because the fromTerm is hit often, aren’t cached at all.

Query time joining is quite straight forward it use and provides a solution the search through relational data. As described there are other ways of performing this. How did you solve your relation requirements in your Lucene based search solution? Let us know and share your experiences and approaches!