Java clients behavior during a split-brain situation in Elasticsearch
In my previous blog post I explained what the split-brain problem is for elasticsearch and how to avoid it, but only briefly spoken about how it manifests. In this post I’m going to expand on what actually happens to your indexing and query requests after the split-brain has occurred. As I’m sure you’re already aware, it depends! It depends on the type of client you use. Because Java is my specialty, I’m going to write about the two types of clients elasticsearch supports through the Java API: the transport client and the node client.
Simulating the split-brain problem
First thing I did was to set up two virtual machines running Ubuntu 13.10 inside Oracle’s VirtualBox. Using the “Bridged Adapter” for the network settings, I had a “virtual” elasticsearch cluster up and running in no-time.
After creating a single shard, single replica index in elasticsearch, I wrote a small Java application that continuously indexed documents and queried from the newly created index using two separate threads (one for indexing and one for querying).
In order to simulate the split-brain I just added a rule to the iptables, blocking all traffic from one of the machines to the other:
iptables -A INPUT -s 192.168.0.1 -j DROP
iptables -A OUTPUT -d 192.168.0.1 -j DROP
I executed the above commands after indexing ~200 documents. I ran the same test two times, once using the transport client and once using the node client. All tests were executed using version 0.90.5 of elasticsearch (latest at the time of writing).
Testing with the transport client
Not surprisingly (as the transport client is non-cluster aware), breaking the communication between the two nodes of the cluster does not result in failures in indexing or querying with the transport client. Immediately after executing the iptables exclusions, elasticsearch takes about a minute before each node decides that the other has failed. During this time, the transport client pauses all indexing and querying. Because I set no timeouts in my code, all requests just waited quietly for elasticsearch to report a working cluster. As expected, each node elected itself as master with each node now having a distinct copy of the index.
Indexing and querying requests were randomly directed to each node. It goes without saying that the two copies of the index diverged with each indexing request. No failures or warnings were observed – from the client’s point of view everything was working as expected for the full duration of the test.
Testing with the node client
Because the node client actually creates a node which joins the elasticsearch cluster, I expected this test to yield different results than the transport client. After executing the iptables exclusions, indexing requests paused completely, but querying requests were handled as if nothing happened, in a round robin manner between the nodes. The queries returned different results, as the replica shard was out of sync with the primary at the moment I cut off the communication between the shards.
After the two nodes decided that the other has failed, the node client randomly chose one of them as the new cluster master. From that point on, both indexing and querying requests were only directed to this new master. Again, no failures were observed for the duration of the test, but the node client did report that it switched masters (INFO level statement).
Because the node client communicated with only a single node after the split-brain occurred, the two copies of the index did not diverge, but only one of the two was actually up-to-date. I believe that this is an advantage of the node client versus the transport client.
What about a cluster restart?
After a split-brain occurs, the only way to fix it is to remedy the problem and restart the cluster. This is where it gets tricky and slightly scary. When the elasticsearch cluster starts, it chooses a master (most likely the first node that starts is elected as master). Regardless of the fact that the two copies of the index may be different, elasticsearch will consider the shards that reside on the elected master as “master copies” and push these to all the other nodes in the cluster. This can have serious implications. Let’s imagine that you were using the node client and one node holds the correct copy of your index. But if your other node starts first and is elected as master, it will push an outdated copy of the index to the other node, overwriting it, thus losing valid data.
In conclusion
So what to do when trying to recover from a split-brain? Well, my first recommendation is to keep a backup in case reindexing all the data is not viable. Secondly, if a split-brain situation has occurred, take great care in deciding how to restart your cluster. Shut down all the nodes and make conscious decisions about which nodes to restart first. If needed, start each node independently and analyze its copy of the data. If it’s not valid, shut down the node, and delete the contents of its data folder (possibly do a backup before). After you’ve decided which is the node that holds the copy that you want to keep, start it and check its logs that it has been elected as master. After this you can safely start the other nodes in your cluster.
If you’re having trouble with your elasticsearch cluster, or maybe need a second pair of eyes to validate design decisions, don’t hesitate to contact us.