Geo-Location Search with Solr and Lucene
Like many people, I use Google Maps to find businesses near my house. This kind of search differs considerably from usual free text searches since I’m no longer just asking for companies that include the phrase pizza, I’m also asking for companies that are near a certain point. This functionality has until recently been found mostly in closed source commercial search applications. However, in the last 6 months support for spatial search has begun to be added to Apache Lucene and Solr, much of which has been developed here at JTeam.
Spatial Lucene
Spatial Lucene, a Lucene contrib which has been extended in LUCENE-1732, is at the heart of the spatial search support in Lucene and Solr. It provides a proper Lucene Filter called SpatialFilter which filters out those documents in the index that are outside of the radius of a certain point. Conceptually the filtering need not be very complicated – calculate the distance each document is from the point and filter out those that are too far away. However if this were done for each document in an index of say, 1 million documents, then the filtering would take 10s of seconds – unacceptable given that normal free text searches usually take 10s of milliseconds. Consequently SpatialFilter uses a two step filtering process which considerably reduces the total time needed.
Cartesian Grid Filtering
The first step is applying a Cartesian grid filter. This filter works by wrapping a grid around the Earth, storing in each document which grid square it is in, working out which grid squares are within the radius of center of the search, and filtering out those documents which are not in this squares. This process is remarkably efficient since the calculations of which squares each document is in, is done when the document is first indexed, and finding those documents that are in certain squares can be done by TermQueries. In fact for an index of 1 million documents and a search radius of 30km, this process takes a mere 40ms.
See here for more information about the Cartesian Grid filtering process.
Distance Filtering
Depending on the size of the grid squares and how much they were overlapped by the search area, there is likely to be many documents that passed the Cartesian Grid filtering which are not in the search area. Hence the second step in the filtering process removes these by calculating distances from the center of the search, to each of the documents that passed the first step. While the mathematics behind these calculations is very well known, it is very complicated and involves arc trigonometry. Done sequentially, it can take 2s to calculate the distance for 200,000 documents – still unacceptable. Consequently the distance filtering is distributed across multiple threads which results in the time taken to calculate the distances for the same 200,000 documents using 2 threads, being reduced to 400-500ms – much more acceptable.
The time taken to carry out the calculation for each document can also be reduced by assuming that the world is flat. While this is not going to be acceptable over long distances, for many smaller areas the error this would introduce is minimal. Consequently Spatial Lucene allows the user to choose between two implementations of a GeoDistanceCalculator interface – SphericalGeoDistanceCalculator and FlatPlaneGeoDistanceCalculator.
Latitude, Longitude and Geohashes
I have been intentionally vague on exactly what information defines the location of a document. This is because while the distance calculations done in the distance filtering process assume each document has a latitude and a longitude (or simply x and y co-ordinates), Spatial Lucene does not require this information be stored as 2 distinct fields. It also supports the data being stored in a geohashes, which are decoded when necessary.
Using Spatial Lucene
Lucene users can use the SpatialFilter directly, but are responsible for adding the Cartesian Grid information to their documents. Solr users can use Spatial Lucene through the QParser and UpdateProcessor contributed in SOLR-773, otherwise known as Local Solr.
Local Solr
To Solr developers and users, Spatial Lucene should be blackbox. Instead the focus of Local Solr is how to best to incorporate Spatial Lucene, whilst having the least effect on the standard Solr indexing and querying process. To do this, Local Solr contains 3 major components: the SpatialTierQParser which supports translating spatial search queries into instances of SpatialFilter, SpatialTierUpdateProcessor which adds the Cartesian Grid information to documents, and the DistanceFieldValueSource which adds the calculated distances to the search results.
SpatialTierQParser
The SpatialTierQParser supports the following standard syntax for spatial searches:
q={!spatial lat=4.56 lng=10.89 radius=10 calc=arc unit=km}name:pizza
From this, the SpatialTierQParser will construct a FilteredQuery consisting of the user’s actual query, name:pizza, and an instance of SpatialFilter which will filter out those documents that are more than 10km away from latitude 4.56 longitude 10.89. The arguments calc and unit are purely optional, with the only required arguments being lat, lng and radius.
To use the SpatialTierQParser, a user must add the following to their solrconfig.xml:
<br> <queryParser name="spatial" class="org.apache.solr.spatial.tier.SpatialTierQueryParserPlugin" /><br>
and include latField=lat&lngField=lng in search requests.
SpatialTierUpdateProcessor
Solr users are used to adding documents in many different ways, whether its in XML, CSV posts or through the DataImportHandler and do not want to have to add in the Cartesian Grid information themselves. To do this for them, the SpatialTierUpdateProcessor can be added to the UpdateProcessorChain, which each of the UpdateHandlers use. This Processor uses the latitude and longitude of each document to work out what Carestian grid squares the document is in, and adds this information to the documents before they are indexed. To the user, the updating process has not changed at all and they can remain oblivious to complex processing that is occurring underneath.
To use the SpatialTierUpdateProcessor, a user must add the following to their solrconfig.xml:
<br> <updateRequestProcessorChain><br> <processor class="org.apache.solr.spatial.tier.SpatialTierUpdateProcessorFactory"><br> <str name="latField">lat</str><br> <str name="lngField">lng</str><br> <str name="tierPrefix">_tier_</str><br> <int name="startTier">9</int><br> <int name="endTier">17</int><br> </processor><br> <processor class="solr.RunUpdateProcessorFactory"/><br> <processor class="solr.LogUpdateProcessorFactory"/><br> </updateRequestProcessorChain><br>
And to their schema.xml:
<br> <field name="lat" type="sdouble" indexed="true" stored="true" required="true" multiValued="false" /><br> <field name="lng" type="sdouble" indexed="true" stored="true" required="true" multiValued="false" /><br> <dynamicField name="_tier_*" type="sdouble" indexed="true" stored="false"/><br>
DistanceFieldValueSource
Solr user’s naturally want to see in the search results how far each document is from the center of their search. Without it, whats the point really. But distances are not fields that can be added to documents – they change from one search to the next. Further complicating the problem, Solr does not currently support anyway of adding arbitrary information to search results, only the search response, but that goes against our goal of minimizing the effect of integrating Spatial Lucene.
The solution to this included in SOLR-773 is the idea of the FieldValueSource and FieldValueSourceRegistry. FieldValueSources, which are registered with the FieldValueSourceRegistry at query time, are used by each ResponseWriter to add new fields consisting of arbitrary data, to the search results at query time. Hence the DistanceFieldValueSource, which reads the distances stored in the SpatialFilter, is able to add the calculated distances to the results being returned by Solr, as though they were another field in the documents.
To use the DistanceFieldValueSource, simply include distanceField=geo_distance&fl=geo_distance in search requests.
Multi-Threaded Configuration
In addition to these 3 components, Local Solr allows the user to configure how many threads are given to Spatial Lucene for its multi-threaded filtering. This can be done by first configuring Solr’s ExecutorService (also introduced in SOLR-773) by adding the following to the solrconfig.xml:
<br> <executorService corePoolSize="2" maxPoolSize="10" keepAlive="2"/><br>
and then by including the parameter threadCount in search requests, stating how many threads Spatial Lucene should use to process the request e.g. threadCount=2
Conclusion and Future of Spatial Search
Hopefully I have shown you how extensive and mature the support for spatial search in Lucene and Solr has become and that with not much effort, your Solr users will be able to search for pizza delivery shops near their house. I would love to hear from anyone who is using either LUCENE-1732 or SOLR-773.
Future versions of these patches may include support for search with regular polygons, and the introduction of distance facets, allowing Solr users to be able to filter their results based on the calculated distances.