Apache Lucene & Solr 3.5.0

by Chris MaleDecember 14, 2011

Just a little over two weeks ago Apache Lucene and Solr 3.5.0 were released.  The released artifacts can be found here and here respectively.  As part of the Lucene project’s effort to do regular releases, 3.5.0 is another solid release providing a handful of new features and bugs.  The following is a review of the release, focusing on some changes which I in particular found of interest.

Apache Lucene 3.5.0

Lucene 3.5.0 has a number of very important fixes and changes to both its core index management and userland APIs:

  • LUCENE-2205: Internally Lucene manages a dictionary of terms in its index which is heavily optimized for quick access. However the dictionary can consume a lot of memory, especially when the index holds millions/billions of unique terms. LUCENE-2205 considerably reduces (3-5x) this memory consumption through a rewrite of the datastructures and classes used to maintain and interact with the dictionary.
  • LUCENE-2215: Strangely enough, despite being one common usercases, Lucene has never provided an easy and efficient use for the deep paging API. Instead users have had to use the existing TopDocs driven API which is very inefficient when used with large offsets, or have had to roll their own Collector. LUCENE-2215 addresses this limitation by adding searchAfter methods to IndexSearcher which will efficiently find results that come ‘after’ a provided document in result sets.
  • LUCENE-3454: As discussed here, Lucene’s optimize index management operation has been renamed to forceMerge to clarify the common misunderstanding that the operation is vital. Some users had considered it so vital that they optimized after each document was added. Since 3.5.0 is a minor release, IndexWriter.optimize() has only been deprecated however it has been removed from Lucene’s trunk therefore it is recommended that users move over to forceMerge where appropriate.
  • LUCENE-3445, LUCENE-3486: As part of the effort to provide userland classes with easy to use APIs for managing and interacting Lucene indexes, LUCENE-3445 adds a SearchManager which handles the boilerplate code so often written to manager IndexSearchers across threads and reopens of underlying IndexReaders. LUCENE-3486 goes one step further by adding a SearcherLifetimeManager which provides an easy-to-use API for ensuring that users uses the same IndexSearcher as they ‘drill-down’ or page through results. Interacting with a new IndexSearcher during paging can mean the order of results will change resulting in a confusing user experience.
  • LUCENE-3426: When using NGrams (for the term “ABCD”, the NGrams could be “AB, “BC”, “CD”) and PhraseQuerys, the Queries can be optimized by removing any redundant terms (the PhraseQuery “AB BC CD” can be reduced to “AB CD”). LUCENE-3426 provides a new NGramPhraseQuery which does such optimizations, where possible, on Query rewrite. The benefits, a 30-50% performance improvement in some cases, especially beneficial for CJK users, where NGrams are prevalent.

Lucene 3.5.0 of course contains many smaller changes and bug fixes.  See here for full information about the release.

Apache Solr 3.5.0

Benefiting considerably from Lucene 3.5.0, Solr 3.5.0 also contains a handful of useful changes:

  • SOLR-2066: Continuing to be one of Solr’s most sought after features, the power and flexibility of result grouping continues with SOLR-2066 which adds distributed grouping support. Although coming at a cost of 3 round trips to each shard, SOLR-2066 all but closes the book on what was once considered an extremely difficult feature to add to Solr and sets Solr apart from search system alternatives.
  • SOLR-1979: When creating a multi-lingual search system, it is often useful to be able to identify the language of a document as it comes into the system. SOLR-1979 adds out-of-box support for this to Solr by adding a langid Solr module containing a LanguageIdentifierUpdateProcessor which leverages Apache Tika’s language detection abilities. In addition to being able to identify which language a document is, the UpdateProcessor can map data into language specific fields, a common way of supporting documents of different languages in a multi-lingual search system.
  • SOLR-2881: Is all about sorting Documents with missing values in a field (known as sortMissingLast) improved in Lucene’s trunk and 3x branch, support for using sortMissingLast with Solr’s Trie fields has been added. Consequently it is now possible to control whether those Documents with no value in a Trie field appear first or last when sorted.
  • SOLR-2769: Solr users are now able to use Hunspell for Lucene through the HunspellStemFilterFactory. The factory allows the affix and multiple dictionary files to be specified, allowing Solr users to use some of the over 100 Hunspell dictionaries used in projects like OpenOffice and Mozilla Firefox in their analysis chain. Very useful for users having to support rarely used languages.

Solr 3.5.0 also contains many smaller fixes and changes.  See the CHANGES.txt for full information about the release.

Lucene & Solr 3.6.0?

With changes still being made to the 3x branch of both Lucene and Solr, and the release of Lucene and Solr 4 it is very likely that 3.6.0 will be released in a couple of months time.