Migrating Apache Solr to Elasticsearch
Elasticsearch is the innovative and advanced open source distributed search engine, based on Apache Lucene. Over the past several years, at Trifork we have been doing a lot of search implementations. Driven by the fact that every other customer wanted the ‘Google-experience’ (just a text box, type some text and get relevant results) as part of their application, we started by building our own solutions on top of Apache Lucene. That worked quite well as Lucene is the defacto standard when it comes to information retrieval. But soon enough, due to Amazon, CNet and Funda in The Netherlands, people wanted to offer their users more ways to drill down into the search results by using facets. We briefly started our own (currently discontinued) open source project: FacetSearch, but quickly Solr started getting some traction and we decided to jump on that bandwagon.
Starting with Solr
So it was then we started using Solr for our projects and started to be vocal about our capabilities, that led to even more (international) Solr consultancy and training work. And as Trifork is not in the game to just use open source, but also contribute back to the community, this has led to several contributions (spatial, grouping, etc) and eventually having several committers on the Lucene (now including Solr) project.
We go back a long way…
At the same time we were well into Solr, Shay Banon, who we knew from our SpringSource days, started creating his own scalable search solution, Elasticsearch. Although, from a technical perspective a better choice for building scalable search solutions, we didn’t adopt it from the beginning. The main reason for this was that it was basically a one-man show (a veery good one at that I might add!). However, we didn’t feel comfortable recommending Elasticsearch to our customers as if Shay got hit by a bus, it would mean the end of the project. However, luckily all this changed when Shay and some of the old crew from the JTeam (the rest of JTeam is now Trifork Amsterdam) decided to join forces and launch Elasticsearch.com, the commercial company behind Elasticsearch. Now, its all systems go and what was then our main hurdle has been removed and we can use Elasticsearch and moreover guarantee continuity for the project.
Switching from Solr to Elasticsearch
Obviously we are not alone in the world and not that unique in our opinions, so we were not the only ones to change our strategy around search solutions. Many others started considering Elasticsearch, doing comparisons and eventually switching from Solr to Elasticsearch. We still regularly get requests on helping companies make the comparison. And although there are still reasons why you may want to go for Solr, in the majority of cases (especially when scalability and realtime is important) the balance more often than not goes in favor of Elasticsearch.
This is why Luca Cavanna from Trifork has written a plugin (river) for Elasticsearch that will help you migrate from your existing Solr to Elasticsearch. Basically, from Elasticsearch pulling the content from an existing Solr cluster and indexing it in Elasticsearch. Using this plugin will allow you to easily setup an Elasticsearch cluster next to your existing Solr. This will help you get up to speed quickly and therefore enables a smooth transition. Obviously, this tool is used mostly for that purpose, to help you get started. When you decide to switch to Elasticsearch permanently, you would obviously switch your indexing to directly index content from your sources to Elasticsearch. Keeping Solr in the middle is not a recommended setup.
The following description on how to use it is taken from the README.md
file of the Solr to Elasticsearch river / plugin.
Getting started
First thing you need to do is: download the plugin
Then create a directory called solr-river
in the plugins
folder of Elasticsearch (and create it in the elasticsearch home folder, if it does not exist yet). Next, unzip and put the contents of the ZIP file (all the JAR files) in the created folder.
Configure the river
The Solr River allows to query a running Solr instance and index the returned documents in elasticsearch. It uses the Solrj library to communicate with Solr.
It’s recommended that the solrj version used is the same as the solr version installed on the server that the river is querying. The Solrj version in use and distributed with the plugin is 3.6.1. Anyway, it’s possible to query other Solr versions. The default format used is in fact javabin
but you can solve compatibility issues just switching to the xml format using the wt
parameter.
All the common query parameters are supported.
The solr river is not meant to keep solr and elasticsearch in sync, that’s why it automatically deletes itself on completion, so that the river doesn’t start up again at every node restart. This is the default behaviour, which can be disabled through the close_on_completion parameter.
Installation
Here is how you can easily create the river and index data from Solr, just providing the solr url and the query to execute:
curl -XPUT localhost:9200/_river/solr_river/_meta -d ' { "type" : "solr", "solr" : { "url" : "http://localhost:8080/solr/", "q" : "*:*" } }'
All supported parameters are optional. The following example request contains all the parameters that are supported together with the corresponding default values applied when not present.
{ "type" : "solr", "close_on_completion" : "true", "solr" : { "url" : "http://localhost:8983/solr/", "q" : "*:*", "fq" : "", "fl" : "", "wt" : "javabin", "qt" : "", "uniqueKey" : "id", "rows" : 10 }, "index" : { "index" : "solr", "type" : "import", "bulk_size" : 100, "max_concurrent_bulk" : 10, "mapping" : "", "settings": "" } }
The fq
and fl
parameters can be provided as either an array or a single value.
You can provide your own mapping while creating the river, as well as the index settings, which will be used when creating the new index if needed.
The index is created when not already existing, otherwise the documents are added to the existing one with the configured name.
The documents are indexed using the bulk api. You can control the size of each bulk (default 100) and the maximum number of concurrent bulk operations (default is 10). Once the limit is reached the indexing will slow down, waiting for one of the bulk operations to finish its work; no documents will be lost.
Limitations
- only stored fields can be retrieved from Solr, therefore indexed in elasticsearch
- the river is not meant to keep elasticsearch in sync with Solr, but only to import data once. It’s possible to register
- the river multiple times in order to import different sets of documents though, even from different solr instances.
- it’s recommended to create the mapping given the existing solr schema in order to apply the correct text analysis while importing the documents. In the future there might be an option to auto generating it from the Solr schema.
Hope the tool helped, do share your feedback with us, we’re always interested to hear how it worked out for you and shout if we can help further with training or consultancy.