The Poison Message SQS Problem

AWS’s SQS is a really reliable and performant solution for transporting requests around a system. A couple of years ago, AWS added event sourcing to it, where a Lambda function can be automatically invoked with batches of data from a queue until the queue is empty.

There are lots of important considerations in tuning this, but there’s a particular problem with the approach.

SQS delivers messages in the approximate order of arrival, with the exception of its FIFO queues, which deliver messages with guaranteed order of delivery.

In general, when you’ve processed a message, you delete it from the queue, and if you need to retry a message, you allow it to return to the queue after a message visibility timeout. This is great. You get retries for virtually no effort, and this leads to a large guarantee of delivery.

The Poison Batch Problem

With event sourcing, if you find a situation where some of the messages in the batch handed to your Lambda by AWS can be processed right now, but some cannot, then you have a dilemma:

  • Fail the batch and your unprocessed messages will be retried – but so will the processed ones
  • Don’t fail the batch and the unprocessed messages will be lost

This is because the integration code at the AWS end only has the overall success/failure rate to apply to the messages from the batch.

This is simply not good enough for real life usage.

Well, it kind of is. You design your systems to seldom fail and then it’s fine. But if there are downstream things that can go wrong a few times, then this problem starts to suck.

Does The Dead Letter Queue Help?

With a redrive policy that points the data at a dead letter queue, is there some hope of salvation?

If anything, it’s kind of worse. After a few retries, in which maybe the successful messages have been processed a few times over (or, depending on how you do it, some of them have, and some of them have been stuck behind a message that always fails), all the messages from the batch are sent to the dead letter queue.

Maybe there’s a tool to replay message from the dead letter queue back on the main queue. (We’ve all built them, right?)

What happens then? The same adjacent messages fall into the same batches and the whole process repeats.

The Poison Message

In general, the poison message, that can never be processed, stays in the same neighbourhood in the queue as its batch-mates. In other words, the effect of it repeats and repeats.

A Solution

Though the Lambda integration will delete messages from a queue if the Lambda succeeds, there’s nothing stopping you from explicitly deleting successful messages from the queue within the Lambda. This has the effect of eliminating successful messages from further retries.

If you wait until all messages have been attempted before terminating the Lambda with success/failure, then you can avoid a later message being stopped by a poisoned earlier one.

This does have a weird side effect. The metrics showing the deletion rate from the queue will appear to be double the rate of messages arriving on the queue, as the majority of messages get deleted twice. This doesn’t appear to cause SQS any harm, as it’s resilient to the possibility of multiple shards getting different copies of the same request.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s