Solr and Lucene 3.1 Release
The new release of Solr and Lucene 3.1, available here and here, is the first major release for Solr in almost two years and the first joint release of both projects. With each project having resolved several hundred issues leading to the release, lets take a look at the major improvements and new features including the much sought after spatial search, UIMA integration and improved analysis support for Unicode.
Together At Last
As mentioned, the release of Solr and Lucene 3.1 is the first joint release of both projects since the merger of their development in 2010. The merger has perhaps had the greatest impact on Solr, which has always struggled to keep up with the changes in Lucene’s API. With the combined source code and testing, a major component of the work leading up to Solr 3.1 was bringing Solr up to date. A particular focus was moving all Solr’s extensive analysis support over to the Lucene’s improved TokenStream API. The result is that Solr users now have access to the latest of Lucene’s powerful features, performance improvements and extensive test suite.
So what does Solr 3.1 contain? Well with over 200 JIRA issues resolved, its safe to say that Solr 3.1 is jam packed with bug patches, improvements, and new features. The following is summary of those issues which I feel really stand out:
- Spatial Search in Solr: The extremely highly demanded support for spatial search in Solr is now available. Users are now able to make use of a full range of function queries and filters to find those documents within a radius of a point, boost those closest to the centre of the circle, and even sort by distance. Having been involved in this issue for a number of years now, its really great to see it finally available.
- Sorting by Function Query: As part of the work on SOLR-773, work has been done to support sorting search results using Function Queries. This means for example, you can now sort your results so those documents closest to the centre of your search are first. Or, by mathematical logic related to a price contained in each document.
- PolyFields / Complex FieldTypes: Until Solr 3.1, Solr’s FieldTypes had a 1:1 relationship with Fields that were indexed by Lucene. However, also as part of the work on SOLR-773 this was changed so there could be a 1:N relationship, allowing Solr to have more powerful complex FieldTypes that could hide complexity from users. A prime example of these FieldTypes in action is the new LatLonType which presents the two index level fields as a single field to the user.
- UIMA Integration: Apache UIMA is a powerful framework that supports annotating and extracting entities from blocks of unstructured text. SOLR-2129 added an UpdateProcessor that allows users to integrate UIMA into the processing that is done while adding documents. See here for more information on how best to utilise this fantastic feature.
- Extended DisMax: Solr’s DismaxParser has grown in popularity by leaps and bounds. However, it has always been limited to a small set of Lucene’s query syntax. The new ExtendedDismaxQParser adds support for the full range of Lucene’s query syntax and improved character escaping logic to better handle –, + and : characters.
- Browse Page: Solr’s VelocityResponseWriter has been used to create a simple search frontend for Solr. Found at /browse in the Solr example, this can be used to quickly produce POC and demos that can be used to illustrate Solr’s power, and as part of application development as a way to quickly visualize results, facets and even clusters.
- Other issues of note is the adding of CSV ResponseWriting, distributed support to Solr’s SpellCheckComponent, and support for Lucene’s FastVectorHighlighter.
For Lucene, the 3.1 release is a continuation of the work that had been started in the 3.0.x series of releases. While it lacks as many new features as Solr 3.1, it too contained over 200 resolved issues, the majority bug fixes. One aspect that has once again received a great deal of attention is Lucene’s Analyzers, Tokenizers and Stemmers. The following is once again a summary of issues which I felt stood out:
- StandardTokenizer using UAX#29: Lucene’s popular StandardTokenizer has received a much needed facelift through the use of JFlex’s Unicode functionality. While the original StandardTokenizer was primarily focused on supporting European languages, the improved version can now be used for most of the world’s languages.
- Deprecation of Language Specific Tokenizers: Since through LUCENE-2167, the StandardTokenizer can now be used for most languages, the language specific Tokenizers which dogged Lucene have now been deprecated, with users now being recommended to use StandardTokenizer.
- Lightweight / Minimal European Stemmers: A commonly unknown fact about Snowball stemming is that it is very aggressive, often removing too many characters and producing the same stem for very different words. This can degrade search performance and result quality. The stemmers introduced in LUCENE-2503 use very lightweight and quick algorithms that do the minimum amount stemming necessary for each language. These are very much worth using if you are analysing large amounts of European text.
- IndexWriterConfig: Those instantiating IndexWriters have often found themselves confronted by a bewildering array of constructors and setters. The IndexWriterConfig introduced in LUCENE-2294, allows the IndexWriter to be configured once in a dedicated class, which can then be used almost Factory like, as part of the creation of IndexWriters.
- Numeric* no longer experimental: Lucene’s defacto standard for range queries, NumericRangeQuery, has had its @experimental label removed. This now means you can have greater confidence in its functionality and more importantly, the API, as it is now constrained by Lucene’s backwards compatibility policy.
- New and Improved Directories: With the limit of Lucene’s performance being constantly pushed, one area often over looked are Directory implementations. The light has been shown on this in Lucene 3.1, with MMapDirectory being improved and new Linux and Windows native implementations which get around the Operating System’s often inhibiting filesystem caching. Note the native Directories are still very experimental and useful for indexing only.
While less important to the everyday user, Solr and Lucene 3.1 also contain a series of improvements for developers:
- Enhanced Maven Support: While the debate about the role of Maven in Solr and Lucene seems to have died down, work has been done to improve its support, if only as a second-class citizen. Fully maintained POM templates can now be found in the source code which can be used to create proper POMs with all dependencies listed.
- Parallelized Tests: Any one who has run Solr and Lucene’s tests know that such an extensive test suite has one down side – really long build times. However Solr and Lucene 3.1 now make use of Ant’s support for running JUnit tests in parallel, allowing developers to make use of all their CPU cores.
- Eclipse and IntelliJ support: Most Solr and Lucene developers use either Eclipse or Intellij, at least some of the time. While in the past it has been daunting to setup both projects in either IDE, new Ant tasks have been created which will do the heavy hauling for you. Just use ‘ant eclipse’ or ‘ant idea’.
I thoroughly recommend to any Solr and Lucene users currently using a pre-3x version that they upgrade to v3.1. With so many improvements and new powerful features, you will really see the benefits. And if nothing else, it will serve as a stepping stone to the unreleased but very powerful, Solr and Lucene 4.