Fake It Till You Make It: Fault Injection with Spring using Headers and Baggage Propagation

by Joris Kuipers, March 13, 2026

At Trifork, we build custom software that powers the core business of many clients. For Nederlandse Loterij we’re developing an integration platform that handles, among many other use cases, the purchasing of lottery tickets. This involves some fairly complex orchestration logic, where we are coordinating between the lottery system that issues the tickets and the player account management system that handles the payments and, eventually, registration of the winnings.

As we are in the process of reworking a lot of this functionality, our QA tester is making sure that all possible scenarios are properly handled: not just the happy flow, but also all failure cases where one of the systems involved has issues in the middle of this orchestration flow.

To give you an idea of what this flow looks like: when a player makes a purchase, we first ask the lottery system to verify the purchase and provide the cost. We then make a reservation for that amount in the player’s wallet, and on success ask the lottery system to provision the ticket. We then put a message on a queue to ensure that we complete the reservation with the definitive amount and return the ticket to the player.
Asynchronously we’ll complete that reservation, which should eventually be able to succeed as there’s a reservation already.

Many things could fail during this flow: sometimes in a way that leaves us knowing what happened (if we can’t connect to a system at all, or a known specific error is returned, we can be sure that our request wasn’t processed), sometimes in a way where we have no clue (when a system didn’t respond in a timely fashion, or with an unknown internal error).

If the call to provision the ticket fails, we want to make sure that we cancel everything that already happened so far using a compensating transaction. If the message handling to finalize the reservation fails, we simply want to keep trying (calls are idempotent) and alert if the retries fail as well, since the player already received their ticket and there is a reservation for it.

The problem is: how do you let a tester simulate all these potential error scenarios?
For other cases it’s sometimes as simple as temporarily breaking the connection to a backend system, e.g. by configuring a wrong password or URL. However, for complex flows like the purchasing this doesn’t work: the initial verification call would already fail and nothing would happen.

Fault Injection

We decided to introduce a mechanism for injecting faults for testing purposes on a per-request basis.
This can be done by providing a request header with the HTTP request that’s made to our REST API. In the value of this header, you indicate two things: where the error should happen and what error should be simulated.

That header would look something like this:

X-Force-Fault: LOTTERY_TICKET_CREATE=TIMEOUT_AFTER_SUCCESS

This tells the code that handles the creation of a lottery ticket by calling the lottery system to simulate a timeout after the ticket was, in fact, successfully provisioned. Other faults that we support are timeouts without result, technical errors or known validation errors.

Simulating the errors is the easy part: just throw some exception that would also be thrown when the simulated failure scenario would occur. The tricky part is how to check if fault injection has been requested in your code.

In a single monolithic system without any asynchronous calls, you could of course just check the HTTP request header. In a Spring-MVC web application, you can access the current request (if there is any) by looking it up via the RequestContextHolder.
However, our platform is a distributed system consisting of many services. That means that when the initial HTTP request is made, it typically calls an internal service to handle the orchestration logic. We want to ensure that the X-Force-Fault header is then propagated somehow.

The situation is even trickier when we put a message on a queue to be handled asynchronously: if we want to force a fault in the message handling, we cannot rely on HTTP headers as there is no request then. Also, in this scenario we only want to simulate the fault once: subsequent retries should ignore the requested fault, so that we can test if our application properly retries and if the backend system handles retries in an idempotent fashion.

So what can we do?

Baggage

I’m a big fan of the Discworld book series from the late Terry Pratchett. A recurring “character” in those books is The Luggage: it’s a large magical chest with hundreds of little legs that will move through anything in its way to stay with its owner while keeping their baggage safe.

As it turns out, Micrometer Tracing (which is the tracing library used by Spring) supports a similar notion called “baggage”, a concept that I believe originally comes from Brave.

Baggage is basically a collection of key-value pairs that can be associated with a trace, and that will magically be propagated along with the trace across its spans.

For HTTP requests, this is done using HTTP request headers. This is however also supported for use cases like sending and receiving SQS messages when using Spring Cloud AWS (thanks to the excellent work of Tomaz Fernandes).

So what we can do is configure Spring Boot to propagate our X-Force-Fault header by configuring it as a so-called remote baggage field. Here’s what that looks like in a properties file:

management.tracing.baggage.remote-fields=X-Force-Fault

For HTTP requests, this causes the header to both be propagated as-is in calls to backends and also be included as part of a “baggage” header that combines all fields in a single value based on the W3C Baggage Propagation specification. For SQS messages, only that single baggage key-value is added as a message attribute, as SQS only supports 10 attributes per message and the framework shouldn’t exhaust them when you configure multiple remote fields.

In our code, we now can check for the presence of the baggage field when we need to know if we need to force a fault.
Micrometer allows you to do this via its Tracer that you can inject as a Spring bean, which provides an abstraction over the actual tracing mechanism that’s used. Our application is using Brave rather than Open Telemetry for trace propagation, so we can also call the Brave API directly by using its BaggageField that uses the current trace context obtained from a ThreadLocal variable, which is a little simpler as it doesn’t require any depency injection:

public class BaggageFetcher {
   public static String getBaggage(String key) {
       return BaggageField.getAllValues().entrySet().stream()
           .filter(kv -> kv.getKey().equalsIgnoreCase(key))
           .findFirst()
           .map(kv -> kv.getValue())
           .orElse(null);
   }
}

Then we can write a small TesterCaller helper to wrap code that needs to force the faults, which I won’t show here.

One-off faults in message listeners

As mentioned, for code that handles messages from a queue we want to support fault injection as well, but only for the first delivery attempt: subsequent attempts should ignore the requested fault, as the scenario that we want our tester to be able to simulate is that of a transient error that will be retried successfully. It wouldn’t make sense to just let every delivery attempt fail in the same way.

The BaggageFetcher should work for message handling code just fine, as long as baggage propagation support has been implemented for your messaging oriented middleware of choice. As mentioned, Spring Cloud AWS has us covered here: if you’re using something else, make sure to check if this works for your MoM of choice as well!

That leaves the problem of knowing if we’re inside the first delivery attempt. Using Spring Cloud AWS, this is what I did for SQS:

 @Bean
<T> MessageInterceptor<T> receiveCountMessageInterceptor() {
   return new MessageInterceptor<T>() {
       @Override
       public Message<T> intercept(Message<T> message) {
           var receiveCount = message.getHeaders().get(SQS_APPROXIMATE_RECEIVE_COUNT);
           if (receiveCount != null) {
               BaggageField.create(SQS_APPROXIMATE_RECEIVE_COUNT).updateValue(receiveCount.toString());
           }
           return message;
       }
   };
}

This message interceptor checks the “receive count” for every message that’s received and stores it in another baggage field. There is a difference with our other baggage field, though: this one is really only used as a convenient ThreadLocal variable of sorts and doesn’t need to be propagated.
When you’re using Brave, you can indicate this by configuring a separate property:

management.tracing.baggage.local-fields=Sqs_Msa_ApproximateReceiveCount

Local fields are a Brave-specific feature, but of course you could make this work with Open Telemetry as well by using the Tracer API and a remote field. That property’s value is the value of the io.awspring.cloud.sqs.listener.SqsHeaders.MessageSystemAttributes#SQS_APPROXIMATE_RECEIVE_COUNT constant, BTW.

With that, we can now check both if a fault has been requested AND if we’re in the first delivery attempt:

if ("1".equals(BaggageField.getAllValues().get(SQS_APPROXIMATE_RECEIVE_COUNT))) {
   TesterCaller.withRequestedForcedError(() -> lotteryClient.createTicket(req), LOTTERY_TICKET_CREATE);
} else {
   lotteryClient.createTicket(req);
}

Testing our testing code

On testing this, I initially had some issues: somehow it seemed like the baggage field wasn’t properly propagated when using SQS, although I could see that the message had the expected attribute.

While debugging this, I found that there was a problem in the Micrometer code that parses the baggage field’s contents: it didn’t handle key-value pairs with values containing an equal sign properly. I reported this and not only did the team fix this really quickly, they provided point releases containing the fix immediately after that as well! 🙏

With that fix in place, we are now running this on our test environments with several supported injection points and possible faults. Adding additional injection points is trivial, and our tester is very happy that she can finally validate that flows which don’t follow the happy case are behaving correctly as well!