Processing 1M tps with Axon Framework and the Disruptor
LMAX, a trading company in the UK, recently open sourced one of their core components: the Disruptor. This component allows reduces execution overhead by removing the necessity for locks, while still keeping processing order guarantees. A pretty ingenious piece of engineering, if you ask me. I tried to apply the disruptor to the Axon Command Bus, just to see what it potential is. The results are pretty astonishing.
The Disruptor is a “framework for concurrent programming”. It allows developer to create task-based workflows using a multiple threads. The disrupter can be used to parallelize certain tasks, while still allowing some sequential processing. It removes the need for queues in between these processes. It has a few features that clearly distinguishes it from the concurrency constructs available since Java 5. The main one being the lack of locking mechanisms.
Consider a process where 4 steps need to be taken (A through D) to completely process a task. If task B and C both depend on the result of task A, but have no dependency on each other, they may be executed in parallel. In this example, task D depends on the result of both B and C.
Creating such a flow using mechanisms in Java 5, such as the queue is complex but most of all: slow. Threads need to acquire locks on queues to put or poll items.
The disruptor has another approach. Its main component is the RingBuffer, which is a circular buffer implementation that can hold the tasks to execute. Different tasks read and write the items contained in the buffer, independent of each other. Well, mostly. Somehow, you need to be sure that Task D won’t be processing an item unless both B and C have finished with it.
Different “consumers” on the ring buffer keep an eye on progress of the consumers they depend on. They do that by tracking the sequence numbers of the tasks that the other consumers have processed. This allows the Disruptor to minimize the amount of inter-consumer communication. An example: imagine that Consumer D just finished processing item number 8, It will need to know if it allowed to process 9. It will ask Consumer B and C “where are you”? They might answer “23” and “12”. In that case, consumer D knows that it is safe to process 9, 10 and 11, without asking B and C for their progress in the meantime. After processing 11, it will need to ask again.
Reducing the amount of inter-thread communication allows the CPU to optimize the use of its caches. The LMAX team calls this Mechanical Sympathy. To get the best results, code should match the way CPUs work.
The Axon Command Bus benchmark
The disruptor pattern fits the way commands are processed in a CQRS based architecture. While certain processes “pre-load” an aggregate, another could execute the command handler, while others store events in the event store and publish it on the event bus. I thought I’d give it a try and see how far I would get.
Implementing a proof-of-concept style command bus using the disruptor turned out to be pretty easy. The jar comes with a number of helper classes that help you optimize processing speed (think about the “obvious things”, like applying cache line padding to prevent false sharing).
My benchmark application contains a simple configuration: a command handler that loads an aggregate (only 1 is used in the benchmark) and executes the “doSomething” method on it. That method generates a single event that needs to be stored in an in-memory event store and is published to an event bus without listeners attached to it. The goal of this benchmark is to focus on the speed of “bare command processing”.
I ran the benchmark on my laptop, which has an Intel Core i7 640M processor (2.8 GHz, 2 cores, 4 threads).
When using the SimpleCommandBus and a CachingGenericEventSourcingRepository, I got around 150 000 commands per second on my machine. It’s probably more than most applications get thrown at them, but something to start bragging about in a bar.
Then I created a similar application with a Disruptor based command bus. It executed little over than 250 000 commands in a second. That’s almost twice as fast. But the results disappointed me. There must be a way to improve its speed even more.
So I started tweaking. Axon uses java.util.UUID as unique event identifiers. I though I’d try to remove them completely. Don’t worry, it just for testing. Guess what: I got around 700 000 commands per second. Now were getting somewhere. But not having identifiers is not really an option. But creating random UUID
I changed the UUID generation mechanism to a time-based version. I was thrown back to about 500k, but at least I have my identifiers back.
That’s when I noticed that I was running it all on a 32 bit VM. When running the same benchmarks on a 64 bit JVM, the results nearly doubled. That sounds logical, but Oracle states that migrating to a 64bit VM will degrade performance. With the time-based UUID and some other minor optimizations, I managed to squeeze out 1.3M commands per second. That’s 5x more than the lock based mechanism can do under the same circumstances.
It looks like the disruptor really does process the same logic faster on the same hardware than a lock-based mechanism. Coding wise, it does require a bit of getting used to. But once the producers, consumers and their dependencies are identified, it’s all pretty straightforward to get up-and-running.
I’m eager to get started on a production-ready version of the Disruptor based command bus. I noticed that some of the choices made in the core components in Axon, such as random UUIDs as Event identifiers, prevent application to reach extreme performances. That will need some work. Will be continued…