SSP 2.0 – Spatial Search Plugin for Solr

by Chris Male, December 22, 2010

It has been over a year since we released our Spatial Solr Plugin (SSP) to the community and its great to see that its serving so many users so well. During that time there has also been a great deal of work done on adding official spatial search support to Solr. Much of this work is now complete. However as it is only available in the yet unreleased Solr 3x and 4x, it remains out of the grasp of many users. This coupled with the fact that the future of spatial search support in Lucene is unclear means that there continues to be a place for SSP in the community.

Also during this period we received a lot of feedback from our SSP users about bugs they had found (thanks!) and features they wished SSP had. We also found some bugs ourselves in the Cartesian tier code which had come from Lucene. Faced with the need to fix these bugs as soon as possible and the desire to extend SSP to meet our users’ wishes, we decided to create SSP 2.0, a totally newly designed and written plugin for Solr providing highly efficient and simple spatial search support. In this blog I will introduce SSP 2.0, explain some of the motivations behind our changes and discuss the future for SSP 1.0 users.

Efficient Filtering vs Poles and Meridians

The largest change made in SSP 2.0 was the removal of the Cartesian tier filtering. While an excellent way to quickly and efficiently filter out documents outside of an area, the tier code suffered from bugs in regards to its handling of poles (the North and South variety) and meridians. The code was also very complex and difficult to maintain. Many of the bugs were also identified by Grant Ingersoll in the Lucene trunk. There its future seems dim, with it most likely being deprecated in Lucene 3x and removed in 4x. Consequently when writing SSP 2.0, I sought for a way to remove this code while still providing efficient filtering.

As it turns out a simple and effective way to do this is to exploit the performance of NumericRangeQuerys (NRQs) and create a bounding box around the search circle. Although requiring two NRQs, the bounding box driven approach addressed the problems of the poles and meridians while still providing efficient filtering.

An additional benefit of removing the Cartesian tier code is that the process of indexing data was simplified. It is no longer necessary for fields to be added at index time through an UpdateRequestProcessor since the bounding box NRQs query against the latitude and longitude fields that users are already indexing. This increases indexing performance while reducing the size of your index.

Trading Performance for Performance

When designing SSP 1.0, one of our primary focuses was ensuring high performance. Spatial searches are expensive in comparison to their free text search query cousins, consequently we sought to reduce the cost where possible. We did this in two ways. First, we used multi-threading to compute distances in parallel and secondly, we stored the computed distances and re-used them for sorting and display purposes. It turned out however that this came at a price. Storing the computed distances in the SpatialFilter meant that we could not let the results of the filtering be cached in Solr’s caches. This meant that the filtering had to occur with every query, even for the same search area. Furthermore, storing the computed distances for hundreds of thousands of documents meant that SSP had high resource requirements in a concurrent querying environment.

In SSP 2.0 we made the decision to change from a Lucene Filter which stored the computed distances, to a Lucene Query which scored documents by distance. This meant that the computed distance would become part of the overall score for a query, reducing the memory footprint. The result of the Query could also be cached, meaning that frequently executed queries would have little to no cost. It also had the additional benefit of allowing boosting by distance which I discuss later. It however also meant that we no longer had cached distances to use during sorting and result displaying. Nor could we use multi-threading to reduce the cost of computing distances since Querys are scored in single threads. Consequently we looked for ways to reduce the cost of these operations.

As it turns out, it is possible to compute a simple form of the distance between two points which can be used for comparison sake but not displaying. Consequently we change SSP to use this simple form for scoring and for sorting. This optimization calculation managed to offset some of the performance lost by having to calculate the distance for a document during both filtering and sorting.

For including for true distances in the search results, we adapted our GeoDistanceComponent which was already retrieving the distances from our distance cache, to calculate the distances for the page of results being returned to the user. The cost of calculating distances for say, 10 documents, is negligible and not noticeable.

Distance Relevancy

As mentioned above, SSP 2.0 is based on a Lucene Query which scores documents using a simplified distance. This means that documents which are closer to the centre of our search area are now considered more relevant. This is logical since I would rather go to a pizza restaurant 100m away rather than 1km away. Like other Querys, SSP also supports score boosting. This means that users of SSP can decide for themselves whether distance plays a large or small role in the relevancy of their search results. See below where I discuss the new query syntax for examples.

Simplified Query Syntax

One feature requested by one of our users was to ability to search in multiple search areas, each with their own latitude, longitude and radius. This presented a usability challenge to SSP’s query syntax which was built around the idea of a single search area. Consequently we used this as a motivation to address and simplify our query syntax. We’ve removed as many parameters as possible and reduced the search area syntax to a single parameter circles which concisely defines the areas that will be searched. The following example illustrates the change:

SSP 1.0: q={!spatial lat=52.347 long=4.453 radius=10 unit=km}title:pizza
SSP 2.0: q={!spatial circles=52.347,4.453,10}title:pizza

Here the latitude, longitude and radius are now defined in a unified tuple. It is possible to define multiple search areas by separating the tuples with ‘//’ such as circle=52.347,4.453,10//40.543,10.983,100. Gone are the parameters for units (all distances are now assumed to be in km), calc (all calculations are done using the arc calculator) and threadCount (multi-threading is no longer used).

Two additional parameters have been added:

qtype:Allows the user to specify the QParserPlugin type they wish to use for parsing the free text search component of the query, if one is defined. Users wishing to use dismax queries would set this to dismax. Example {!spatial qtype=dismax …}pizza
boost:Allows the user to define the boost to applied to the spatial queries. Default is 1.0 meaning no boost. Example {!spatial boost=3.0 …}pizza

Query vs Filter Query

A little known feature about SSP is that its not necessary for the spatial aspect of your query to be your main query (q), it can in fact be a filter query (fq). This was not of much use in SSP 1.0 since its Qparser required you to have a query string and because the results of the spatial filtering could not be cached, there wasn’t much benefit to using a filter query.

In SSP 2.0 we addressed these limitations. The SpatialQParser now correctly handles a query string being omitted and since the results of the SpatialQuery are now cachable, you can now benefit from Solr’s filterCache.

Assuming you would normally send the request q={!spatial …}title:pizza, it is now possible to send the request q=title:pizza&fq={!spatial …}.

When should you use which? Well, one limitation of SSP’s support for sorting by distance is that this is handled during the parsing of your main query. Therefore if you wish to sort by distance, you must use the SpatialQParser for your main query. Other than that, it really depends on your use case. If you are doing frequent searches in a few search areas with differing query string, then its best to move the spatial search to a filter query so it can be cached. However if you are often doing a few spatial searches with the same query strings, then its best to have it as your main search query so they can be cached together with the query string results.

Bringing it all Together

So, how do you configure SSP 2.0? The following is an example configuration:

schema.xml

...
<field name="lat" type="double" indexed="true" stored="true"/>
<field name="lng" type="double" indexed="true" stored="true"/>
...

solrconfig.xml

...
<queryParser name="spatial" class="nl.jteam.search.solrext.spatial.SpatialQParserPlugin">
	<str name="latField">lat</str>
	<str name="lngField">lng</str>
</queryParser>

<searchComponent name="geoDistance" class="nl.jteam.search.solrext.spatial.GeoDistanceComponent"/>

<requestHandler name="standard" class="solr.SearchHandler" >
    <lst name="defaults">
        <str name="echoParams">explicit</str>
        <str name="distanceField">distance</str>
    </lst>
    <arr name="last-components">
        <str>geoDistance</str>
    </arr>
</requestHandler>

Thats all thats required. Gone is the need for the additional dynamic field in the schema.xml, and gone is the need for the UpdateRequestProcessor in the solrconfig.xml.

Summary of Changes

The following is a summary of the major changes made in SSP 2.0:

Cartesian tier code has been removed and replaced by dual NRQ bound boxes
SpatialFilter has been replaced by SpatialQuery which scores documents using a simplified distance and supports boosting
Spatial queries can be now be cached
The SpatialQParserPlugin can be used to define a filter query (fq) instead of a query (q) when support for sorting is not required
Multi-threaded filtering is no longer supported
All distances are now assumed to be in kilometres
All returned distances are computed using the arc great circle calculation
Support for searching in multiple areas has been added
Query syntax has been simplified with redundant parameters removed, search area definition combined into a single circle parameter
qtype local parameter has been added allow the user to define which QParserPlugin should be used to parse the free text component of a search
boost local parameter has been added to allow the user to add a boost to SpatialQuerys
SpatialTierQueryParserPlugin has been renamed to SpatialQParserPlugin

SSP 1.0

While we are very pleased by the improvements made in SSP 2.0, we appreciate that many people are using SSP 1.0. If possible, we recommend that you upgrade from SSP 1.0 to 2.0 so that you can benefit from the improvements discussed above. The process for upgrading is very simple:

Add the new SSP jar to your Solr installation and remove the existing SSP jar
Replace your configuration of the SpatialTierQueryParserPlugin which that of the SpatialQParserPlugin shown above.
Remove the configuration of the SpatialTierUpdateRequestProcessor
Remove the dynamic _tier_* field from your schema.
Consider re-indexing your lat and lng fields to use a TrieDoubleField with a greater precisionStep. This is optional assuming you used a TrieDoubleField in your SSP 1.0 setup as well. The greater precisionStep will increase the performance of the underlying bounding box queries

For those who are unable to upgrade, JTeam is committed to continue supporting SSP 1.0 users. To find out more about how we can do this, contact us.

Future and Roadmap

JTeam has every intention to continue to develop SSP and will strive to patch any bugs as soon as possible. The following features are on our roadmap:

Add single NRQ morton number bounding boxes
Polygon search areas
Clustering / collapsing of results that are physically close to one another
Distance faceting
Continual performance improvements and optimizations

Merry Christmas!