Importing data from another Solr

by Luca CavannaNovember 8, 2011

The Data Import Handler is a popular method to import data into a Solr instance. It provides out of the box integration with databases, xml sources, e-mails and documents. A Solr instance often has multiple sources and the process to import data is usually expensive in terms of time and resources. Meanwhile, if you make some schema changes you will probably find you need to reindex all your data; the same happens with indexes when you want to upgrade to a Solr version without backward compatibility. We can call it “re-index bottleneck”: once you’ve done the first data import involving all your external sources, you will never want to do it the same way again, especially on large indexes and complex systems.

Retrieving stored fields from a running Solr

An easier solution to do this is based on querying your existing Solr whereby it retrieves all its stored fields and reindexes them on a new instance. Everyone can write their own script to achieve this, but wouldn’t it be useful having a functionality like this out of the box inside Solr? This is the reason why the SOLR-1499 issue was created about two years ago. The idea was to have a new EntityProcessor which retrieves data from another Solr instance using Solrj. Recently effort has been put into getting this feature committed to Solr’s dataimport contrib module. Bugs have been fixed and test coverage has been increased. Hopefully this issue will get released with Solr 3.5.

Let’s give it a try ourselves!

First of all we need to setup the Solr instance which will act as data source: we can use the standard Solr example as explained here. Then we need to setup the Solr instance on which we will import the data: until the feature will be committed, we need to checkout the trunk or the 3x branch. In this example we will checkout the second one and apply the latest SOLR-1499 patch to it.

Let’s run the new Solr instance through ant:

 ant run-example -Dexample.solr.home=example/example-DIH/solr/ -Dexample.jetty.port=8888

The patch itself contains a new example of Data Import Handler based on the new functionality, here is the configuration fragment for the request handler (solrconfig.xml):

 <requesthandler class="org.apache.solr.handler.dataimport.DataImportHandler" name="/dataimport">
    <lst name="defaults">
        <str name="config">solr-data-config.xml</str>
    </lst>
</requesthandler>

Note that the default value for the clean parameter is true; it means that before every import the index will be cleaned up; this is annoying, especially if you need to import data from multiple cores in different phases. Even the default values for the commit and optimize parameters are true, hence you might want to avoid committing or optimizing after every import as long as your index is big and you use the autocommit behaviour. So, the following fragment may be useful to specify our values and lock them down:

 <lst name="invariants">
  <str name="clean">false</str>
  <str name="commit">false</str>
  <str name="optimize">false</str>
</lst>

This is an example of solr-data-config.xml containing the basic parameters: the url of the solr instance acting as source and the query to be executed.

 <dataconfig>
  <document>
    <entity name="sep" processor="SolrEntityProcessor" query="*:*" url="http://localhost:8983/solr/">
    </entity>
  </document>
</dataconfig>

The following are some of the additional parameters you can configure:

  • rows: number of rows to retrieve for every query, the default is 50;
  • fields: the fl parameter to retrieve a certain subset of fields;
  • timeout: the timeout for each query, the default is 5 minutes.

You can also specify one or more field elements inside the entity element if you need to transform them, for example changing field names to match a different schema. In the above example the fields from the source index are the same as in the target index, so we don’t have to put in any fields in addition.

Conclusion

This blog entry has shown you how to reindex your data without involving all your external sources. You can proceed this way if all your fields are configured as stored. We are still working on this feature, hopefully the SolrEntityProcessor will be included on the Solr 3.5 release.