Elasticsearch beyond “Big Data” – running elasticsearch embedded

by Luca CavannaSeptember 13, 2012

elasticsearchTrifork has a long track record in doing project, training and consulting around open source search technologies. Currently we are working on several interesting search projects using elasticsearch. Elasticsearch is an open source, distributed, RESTful, search engine built on top of Apache Lucene. In contrast to for instance Apache Solr, elasticsearch is built as a highly scalable distributed system from the ground up, allowing you to shard and replicate multiple indices over a large number of nodes. This architecture makes scaling from one server to several hundreds a breeze. But, it turns out elasticsearch is not only good for what everyone calls “Big Data”, but it is also very well suited for indexing only small amounts of documents and even running elasticsearch embedded within an application, while still providing the flexibility to scale up later when needed.

As most developers know, most databases offer full-text search capabilities on the data that is stored. However, from our experience often more is needed and that is where Lucene-based solutions come in. And elasticsearch is currently our technology of choice when it comes to greenfield projects, as it provides all the features you typically need and combines it with scalability.

For this case, we decided to use elasticsearch as part of a bigger project for the University of Amsterdam (UvA). We use elasticsearch to “cache” course information that is retrieved from a Peoplesoft SiS system and make it searchable. And in this case we decided to fire up a local elasticsearch node within an existing Spring web application, using it as an embedded search engine.

We already had a Spring web application in place that was able to get data from Peoplesoft SiS through web services. One of the goals of the project was to allow users to search through the course data. The application was meant to be run on a single Tomcat instance and we wanted to avoid having to run a different application server just to provide the search features. Since the application was single instance, it was doable to store the index on the same server, within a local file system. For these reasons we started to look into using an embedded solution, which would also avoid any http traffic in order to index and search. We first thought about using plain Lucene: the search requirements looked pretty easy at first, but using Lucene does require quite some code, while the search servers built on top of it allow you to concentrate more on your data flow. Even though you don’t need to write Lucene code, you do need to know quite a lot about it in order to use it properly. However, elasticsearch makes your life easier and provides additional features as well, e.g. caching. That’s why we looked into embedding elasticsearch in our application, which is pretty easy to do thanks to its Java API. All the REST APIs provided by elasticsearch are in fact exposed through Java APIs, since effectively that is the way elasticsearch itself processes every request internally.

First thing you need to do is add the elasticsearch dependency to your project in your POM (assuming you are using Maven). Note that the elasticsearch artifacts are hosted on the sonatype repository:

<dependency>
   <groupId>org.elasticsearch</groupId>
   <artifactId>elasticsearch</artifactId>
   <version>0.19.9</version>
</dependency>

After that you can either create a Client object in your application code in order to send requests to an existing elasticsearch instance or a Node object in order to start a new node and eventually join an existing cluster.

In order to create and start our embedded elasticsearch node within the existing spring application we created a FactoryBean and combined it with the use of the InitializingBean and DisposableBean interfaces. The following is the afterPropertiesSet method, which effectively fires up the node:

@Override
public void afterPropertiesSet() throws Exception {
    ImmutableSettings.Builder settings =
        ImmutableSettings.settingsBuilder();
    settings.put("node.name", "orange11-node");
    settings.put("path.data", "/data/index");
    settings.put("http.enabled", false);
    node = NodeBuilder.nodeBuilder()
        .settings(settings)
        .clusterName("orange11-cluster")
        .data(true).local(true).node();
}

Our cluster will be called orange11-cluster rather than the default elasticsearch. Our node will contain data and will be a local node, which means that other nodes can join the same cluster only if they belong to the same java process. Due to this parameter the transport port used for the inter-node communications (9300 by default) will not be used. Through the settings we are providing the name of the node, the directory where we are going to store the index and a boolean parameter that allows to choose whether we want to enable the http connector or not. In fact, we don’t need http in production, but it’s useful while developing in order to have access to our elasticsearch instance. Let’s not forget to stop our node together with the application:

@Override
public void destroy() throws Exception {
    node.close();
}

Once our node is started we need to somehow connect to it in order to index and search data. We need to create a Client object out of the existing Node. We can do it again using a FactoryBean. The elasticsearch Client object is thread safe and its lifecycle is meant to be similar to the application lifecycle itself, that’s why you don’t need to create an instance for each request. A singleton client for the whole application is fine. We’ll inject the Node to the ClientFactoryBean and create the Client object like this:

@Override
public void afterPropertiesSet() throws Exception {
    client = node.client();
}

Again, let’s not forget the destroy method to close the client when the application is stopped:

@Override
public void destroy() throws Exception {
    client.close();
}

We are now ready to use the Client object to submit requests to the elasticsearch node using the Java API. For example, we can create the orange11 index like this, providing our own settings and mapping:

CreateIndexRequest request =
    Requests.createIndexRequest("orange11")
        .settings(yourSettings)
        .mapping(yourMapping);
CreateIndexResponse response =
    client.admin().indices().create(request).actionGet();

We can then index a document like this:

IndexRequest indexRequest =
    Requests.indexRequest("orange11")
        .type("blog")
        .id("1")
        .source(jsonDocument)
IndexResponse indexResponse =
    client.index(indexRequest).actionGet();

And here is how we can search using a query string, one of the many queries provided by the powerful elasticsearch query DSL:

QueryBuilder queryStringBuilder =
    QueryBuilders.queryString(query)
        .field("title", 2)
        .field("content");
SearchRequestBuilder requestBuilder =
    client.prepareSearch("orange11")
        .setTypes("blog")
        .setQuery(queryStringBuilder);
SearchResponse response = requestBuilder.execute().actionGet();

We are really happy with this solution since it is exactly what we were looking for and it performs really well too. We are using a single shard with no replica and the index is stored only locally, which can be dangerous. In our case that’s not a problem since it takes only a couple of minutes to do a complete re-index retrieving all the data from the external web service. That’s why we are not that worried about losing our data. And if one day we’ll need to scale up, it’ll be pretty simple: it’s just a matter of installing an external elasticsearch cluster, slightly modify the code that creates the Client object, remove the embedded Node from our codebase and do a complete re-index increasing the number of shards. Our data will automatically distribute over the nodes belonging to our cluster.

Even though we are only using a subset of the elasticsearch features for this project, it still adds a lot of value. Conclusion: running elasticsearch embedded as part of your application can be an easy way to add powerful search capabilities to your application, without having to install a separate process.