What's so cool about elasticsearch?
Whenever there’s a new product out there and you start using it, suggest it to customers or colleagues, you need to be prepared to answer this question: “Why should I use it?”. Well, the answer could be as simple as “Because it’s cool!”, which of course is the case with elasticsearch, but then at some point you may need to explain why. I recently had to answer the question, “So what’s so cool about elasticsearch?”, that’s why I thought it might be worthwhile sharing my own answer in this blog.
So here it goes…
First of all, what you’ll notice as soon as you start up is how easy elasticsearch is to use. You index your JSON documents, then you can make a query and retrieve them, with no configuration needed. One of the reasons is that it’s schema-less, which means it uses some nice defaults to index your data unless you specify your own mapping. For more precision it has an automatic type guessing mechanism which detects the type of the fields you are indexing, and it uses by default the Lucene StandardAnalyzer while indexing the string fields.
If you need something beyond the standard choices you can always define your own mapping, simply using the Put Mapping API. In fact every feature is exposed as a REST API.
Now let’s look at a few examples
How do you index documents? Using the Index API, sending an http PUT request to your node specifying the index, the type and the id of the document in the url as well as the document itself in JSON format as the request body.
How do you get back a document by id? Using the Get API, sending an http GET request specifying the index, the type and the id of the document in the url.
And what if you want to make a query? You can use the Search API to submit your query and get your results back, again in JSON format.
Up until now I only mentioned the basics, but you have way more than this. You can modify your settings, see the cluster state and detailed information about every node which belongs to it, you can delete an index (the same as you would do manually by deleting the index from the filesystem), or even temporarily close one and then reopen it. You basically (almost never) need to login to the server to change the configuration.
And what about search I hear you ask…
Well when it comes to search, elasticsearch provides its own Query Domain Specific Language through which you can express whatever complex query you want. You pick the query that you need and write it in JSON format. Some of the queries allow you to nest other queries as well. The query DSL is intuitive and powerful at the same time, great qualities knowing that queries are never trivial in real projects. You usually end up needing to search on different fields with different weights, by applying some filters or conditions to boost document rating based on the value of some pre-defined fields, facebook likes, recent documents, facets,highlighting and so on. All of this can be done through a single query and there is a need to express this complexity: the elasticsearch query DSL is the answer.
In our Orange11 search training we give an introduction on how Lucene works internally, which I think is important to address here, even though you are not going to write Lucene code. We talk about what an inverted index looks like, what a segment is, a commit and a flush, and so on. Knowing what happens behind the scenes is really important to understand how a search engine works and more importantly how to make good use of it. Elasticsearch actually helps you out with this, since its APIs are directly related to Lucene operations and use the same names. This applies to the query DSL too. For instance if you’ve used a Term Query you might know elasticsearch internally uses a Lucene TermQuery to execute it. The same goes for other queries, that means that while learning elasticsearch you get to know Lucene too, seen as the terminology is the same.
Let’s move onto talking about faceting
This is probably the main reason why a lot of people use full-text search engines nowadays, even if they don’t need to search for text. Facets are almost everywhere and are really important for the user experience. Elasticsearch comes with different types of facets. The Terms Facet is the basic one but it’s really flexible: for instance, it allows you to explicitly exclude some terms from the entries, or to show only the entries that match a provide a regular expression. You can add a script to a Terms Facet and specify which entries need to be included or excluded, as well as edit each entry. But there’s more: you can also make a facet on multiple fields, as well as on a script field, meaning that you can provide the terms on which you want to make the facet through a script. Also, the Histogram and the Date Histogram facets are really useful for numeric fields and allow you to analyze your data in depth and make histograms out of it. You can have a look yourself at the other facets that elasticsearch provides out of the box.
A Perco what…?
A Percolator, this is a really cool feature that elasticsearch provides. The Percolator allows you to do the opposite of what you’d usually do. Through the Percolate API you can actually index queries and assign a unique identifier to each of them. Instead of querying the index you can then send your documents and get back what queries match out of the existing ones, which can be filtered as well. Let’s say you collect articles from the internet and want to know which users would be interested in them. You can index in elasticsearch what the topics are that every user is interested in, including the queries with an assigned id, which could be the id of the user itself for instance; after that you can “percolate” the articles and get back which users would be interested in them.
Common sense, oops search I mean
Elasticsearch is also smart solving some common search problems. For example, it’s a common practice to have a “catch all” field containing all the text that you index, no matter what field it comes from, so that you can search on it unless you want to search only on a specific field. With elasticsearch that’s provided out of the box by default with the _all special field.
Furthermore, it’s usually hard to explain to people who haven’t worked with Lucene the difference between an indexed field and a stored field. People want to search and get back what they indexed, while with Lucene you need to explicitly specify if you want to store each field; when you search you get back only the fields that you actually stored, while all the others are gone. That means that you can search on them if they are indexed, but you’ll never get back the original field content that you indexed, simply because it’s not there. Well, elasticsearch stores the whole source document that you index by default, so that it’s there when you need it. This becomes really handy if you forget to store a field that you need to show within the search results (a price to pay for poor performance), otherwise you’d need to reindex your data. Furthermore, but not unimportant, this feature allows you to use elasticsearch as a NoSQL store.
And since a Lucene index is composed of write-once segments, if you want to update a document in Lucene you need to delete it and reindex it, there’s no other way. A search engine built on top of Lucene can try to work around this, in order to avoid the need to resubmit the whole document. The problem is that it needs to know what the original document looked like in order to be able to delete it and re-submit the new version automatically, so that the users need to provide only the required changes rather than the whole updated document itself. That wouldn’t be possible unless you stored all your fields in Lucene, and that’s where the _source special field comes into the picture. You can easily update documents using elasticsearch through the Update API. It automatically retrieves the original document (that is possible since it stores the whole source document) and it reindexes it applying the requested changes.
No waiting time needed
One other common problem you have with Lucene is that you usually don’t have your documents available for search immediately, and the operation which makes them available is the most expensive one. Elasticsearch works around this problem; as a result the search is Near Real Time (NRT), which means that by default you need to wait in the worst case just one second to have your documents available for search once you indexed them. And you don’t need to take care of this, elasticsearch does it for you by default. For more precision elasticsearch uses the Lucene NRT API and refreshes the IndexReader automatically every second (by default and of course configurable). Furthermore the Get API returns a document by id in real time thanks to the write ahead transaction log that is kept internally, which allows you to know what happened to the index even though a flush operation (which stores the segments on disk) hasn’t happened yet.
And it’s fast too
What I didn’t mention up to now, is that elasticsearch is just simply really really fast and it’s always a pleasure to have a look at its nice codebase in order to see how it works internally. The project is well organized and consistent too. Furthermore, I didn’t mention its distributed nature, which is also well known, as well as its multi-tenancy capabilities and flexibility. For more details you can have a look at the Data Design Patterns talk that Shay Banon (the author of elasticsearch) gave during his presentation at the last Berlin Buzzwords.
That’s all for now, even though there are other things that I really like about elasticsearch. I hope I’ve convinced you to at least give it a try if you haven’t done so already. I must admit it still surprises me how every time I implement elasticsearch, I keep finding new features that I didn’t know about or different aspects that are simply a pleasure to work with and well that’s simply the reasons why I think elasticsearch is really cool. How about you?