The State and Future of Spatial Search
The release of Solr 3.1, containing Solr’s official spatial search support, has coincided with a new debate about the future of spatial search in Solr and Lucene. JTeam has been involved in the development of spatial search support for a number of years and we maintain our own spatial search plugin for Solr. Consequently this debate is of great interest to us. As it seems unclear at this stage whether the development of spatial search should stay within the Lucene project, or even the ASF at all, I feel its important to get some opinions on the matter and who better to turn to than our readers. So lets take a quick journey through the history of spatial search in Solr and Lucene, talk about what currently is supported and whats not, and discuss the issues confronting its future.
A Brief History of Spatial Search
Spatial search for Solr and Lucene began with the LocalSolr/LocalLucene project hosted on Sourceforge. This project provided rudimentary for finding documents with the radius of a certain point and sorting by the distance of each document. It used a Cartesian Tier system to improve filtering performance by reducing the number of queries needed to find a bounding box to a handful of TermQuerys. After some success this project was encouraged to move into Lucene, and so LocalLucene became the spatial contrib that is currently still in Lucene.
Unfortunately upon joining Lucene, the contrib lost the attention of its maintainers and suffered because of it. Despite many attempts (including some led by myself) to fix bugs / re-write the whole codebase, the contrib remained poorly designed, documented and tested. However it did what it was designed to do (or so we thought), and consequently was quite popular.
Support for adding spatial search to Solr was not done at the same time and it wasn’t until almost 2 years later than attention was given to how best Solr could be adapted to meet the needs of spatial search. This effort (shown in SOLR-773) required a considerable number of changes to Solr, such as support for sorting by FunctionQuery and the creation of complex FieldTypes.
During the work on SOLR-773, fundamental flaws in the spatial contrib, such as incorrect projections, were discovered that precluded Solr from using the contrib’s Cartesian Tier support. Since this was at the core of the contrib, it was decided that Solr would not use the tier system in its spatial support (it would instead use simple bounding boxes or nothing at all) and that the contrib would be deprecated and eventually removed.
The question raised at that point was what, if anything, would replace the contrib?
Spatial Search in Solr 3.1
Despite the spatial contrib’s issues, work on adding spatial search in Solr 3.1 was completed. The support has the following features:
- PolyFieldTypes which abstract away the complexity of indexing spatial data.
- A SpatialFilter which uses NumericRangeQuery based bounding boxes to easily filter out documents outside of a certain range.
- A range of FunctionQuery functions for calculating the distance from points to documents, using a wide variety of different calculations and algorithms.
- Support for sorting by FunctionQuery, allowing search results to be ordered by closest to furtherest.
One issue not resolved in 3.1, is how best to allow the calculated results of the FunctionQuerys to be added to the search results. This is being addressed as part of greater ‘pseudo-field’ work in SOLR-2444 and SOLR-1298.
Despite being comprehensive, functional and efficient, Solr 3.1’s support has its own limitations. Firstly, the support is what I refer to as point-distance driven, that is that documents can only be represented as points in a space, rather than as arbitrary shapes, and the filtering/boosting/sorting of documents is dependent on their distance from a given point. Some can argue that this is sufficient for 80% of use cases, but it is not so useful for anybody wanting to incorporate Solr into a GIS application. The problem with extending the support is that as soon as you begin to deal with other shapes, the complexity of spatial search increases rapidly.
The second issue is that Solr’s support was built entirely in and is very coupled to, Solr. For Solr users this is fine and from a development perspective it means Solr’s support can be quickly changed if needs be. However for Lucene users who are facing the spatial contrib being deprecated and removed, Solr’s support is not a viable alternative unless they want to add the full Solr core dependency to their projects.
Spatial Module
To attempt to address the second issue, it was suggested that a new spatial module (to sit in Lucene 4’s new module section) be created that would contain what is salvageable from the existing spatial contrib and what is able to be extracted from Solr’s spatial support. This module would then be more well designed, documented, and tested than the spatial contrib and would be used by both Solr and Lucene.
The creation of this module is in progress, and I expect the first steps will be completed in the next few months. However the nagging problem of the first issue raised above, has remained on the minds of some of the lead spatial developers. Is point-distance sufficient? or should we add support for more complex shapes? If we do, how do we do that in a way thats maintainable?
Polygons, Geohashes and JTS
In the last few months efforts were begun to Solr and Lucene committer Ryan McKinley (with help from JTeam and myself) to create an alternative spatial search implementation that looked beyond just point-distance support and would support searching for documents defined by bounding boxes and arbitrary polygons. This new implementation uses a Strategy based API, allowing the choice of spatial algorithm to be chosen from a range of implementations, and new ones to be added easily. At the core of Ryan’s work is what’s called the prefix grid which is an implementation that uses quad tree hashes to allow arbitrary shapes to be indexed and queried.
At the same time, contributions were being made to Solr by David Smiley that would extend Solr’s support to use geohashes to represent the points of documents and to create efficient bounding box filtering functionality along the same lines as Cartesian Tier (without the bugs of course) and Ryan’s prefix grid.
As mentioned earlier, supporting more complex shapes comes at a price of complexity. To offset this, both Ryan and David use the Java Topological Suite (JTS), a powerful open source Java library that provides a wide range of geographic functionality. However, unfortunately, JTS uses the LGPL license which is incompatible with the ASF’s ASL.
LGPL, ASF, OSGEO
The license incompatibility of JTS represents a greater issue confronting spatial search support in Lucene. Although the Apache SIS project is working towards providing a geographic suite that is ASF friendly, currently JTS provides the best source of functionality needed to support complex shape spatial search. So this then begs the question can JTS be used in Lucene’s spatial search support? The short answer is no. Although runtime dependencies on LGPL libraries is supported, compile time LGPL dependencies are frowned upon in ASF projects.
Equally, you could argue that spatial search is not at the core of Solr and Lucene, which focus on free-text search. Although spatial search uses free-text concepts to be efficient, it could be seen as a user of Lucene just like Hibernate-search uses Lucene and as such, doesn’t need to be in Lucene.
Do these issues mean that spatial search should go elsewhere? if so, where?
One possible option is osgeo, an open source community focused on geographic functionality. It has fewer constraints on licensing, allowing LGPL libraries to be used at compile time. Its quite well known within the geographic/GIS community, but not nearly as well known generally as the ASF and Lucene. Would spatial search for Lucene suffer if it were located outside of Lucene / the ASF? Would this split the community and what impact would that have?
80/20
An argument against moving spatial search out of Lucene and the use of JTS at all, is that the point-distance functionality currently implemented in Solr is sufficient to meet 80% of spatial use cases likely encountered in Solr and Lucene. There are often situations in Lucene where precision is sacrificed for efficiency and simplicity (such as analysis stemming), so why should spatial be any different? If users are looking for complex and precise spatial search, maybe they should be looking outside of the Lucene ecosystem?
I find it hard to evaluate this argument as yes, in almost all of JTeam’s projects point-distance search has been sufficient. But if the option for more complex spatial search was available, maybe our clients would ask for it or maybe we’d suggest it?
The Future of SSP
These issues of course then raise questions about the future of our Spatial Search Plugin (SSP). We created SSP originally to meet the needs that we saw in the community and in our own projects – simple efficient point-distance driven search for Solr (and technically Lucene). We developed it to target Solr’s 1.4.x versions, which did not have spatial search support. However Solr 3.1 has now been released with its official spatial search support.
Having SSP developed in-house also allowed us to quickly respond to issues and bugs from our user base and issue new versions quickly. Although bugs are usually addressed quickly in Lucene, releases are often very sparse, with over 12 months between most Solr releases.
Is there still a need for SSP now that Solr has its own spatial search support and there will eventually be a module for Lucene? If so, should SSP incorporate complex search support? or should it stick to providing efficient support for the 80%?
Conclusion
Spatial search is definitely a hot topic in the Lucene community at the moment. With so many different opinions and options, its hard to get a consensus about how best to move forward. So tell me, what do you think?