Read the Small Print – Elasticache Serverless Fun

TL;DR

When using AWS Elasticache Serverless, things don’t work unless you ensure your connection to the cluster is made using TLS.

Tim fixed it for me.

Context

We’re using a Redis cache to keep session information about logged in users. This needs to tolerate the volume of logins, both in terms of performance and memory capacity.

In our lower environments, the huge cluster size we guessed we might need was quite expensive to run.

In prod, there’s a risk of under provisioning for peaks we can’t predict.

AWS released serverless Redis back in November, and we thought we’d give it a go.

Capacity Planning

In general, fixed pre-sized resources that are actually busy can be a better cost model than on-demand resources. As load fluctuates wildly, or proves to be unpredictable, the on-demand usage model solves the problem of under/over provisioning as well as potentially reducing the total cost, if the load is really uneven.

More importantly, a serverless elastic solution takes away the rather dark art of right-sizing a cluster.

That said, there’s no guarantee that serverless will scale fast enough for our traffic. We still need to do performance testing, by which I really mean performance tuning.

Another advantage of scalable resources that you then go on to performance test is that you can use the scale the resource reached, during various tests, to identify what an appropriate fixed-size resource might look like.

There’s no escaping performance analysis.

We Gave Serverless Redis a Try (and it didn’t work)

I merged too soon. Too soon in the experiment and too soon as far as our pipeline maturity was concerned.

We had a working Redis Elasticache cluster, I switched the infrastructure code over to using serverless – we hit merge.

The pipeline correctly planned the infra change, and successfully deployed it to dev. The integration tests then failed.

After a couple of tweaks, I backed out the change and had another look at it.

Step 1 was to deploy it to a lower-than-dev environment and manually test it. These tests failed.

Step 1.5 was to ask why I didn’t have some automated way to do this, so we introduced some small-scale contract-tests on the module, and ensured that they would run on a branch. This gave me a new way of experimenting with things like this change before committing to a merge.

Step 2 was to beat my chest and cry to the skies – WHY WHY WHY, WHY does connecting to Redis Serverless just freeze. It worked before.

Tim Fixed It

I had a lot of guesses about why it might be wrong, and I started down an XY problem track in solving the issue. In other words, I’d decided that it must be a networking problem, so went looking for networking solutions. I also assumed that it was an issue with our particular VPC set up.

I was wrong.

However, when Tim offered to help, he asked if I could share an example project that demonstrated the problem. He did that after hacking together something in the console which worked for him.

So, I set about making a small SST project with a serverless redis cluster in it, and proving that this would not work in our VPC. It didn’t.

I then took that project and run it on my personal account with a vanilla out of the box VPC. It also didn’t work.

This was the test that broke my networking hypothesis.

I didn’t know why (I already told you, it was that I hadn’t set TLS on the connection to Redis, but I didn’t know that then). But at least I had a better question for someone to look at with fresh eyes.

The Fix Worked

It was easy to add tls: {} as a property to my ioredis client. This setup broke the tests between that client and Redis running in docker during local testing. It also suprised me that connections previously had been somewhat non-secure on the classic Redis cluster. Who knew, eh?

More importantly, when I made the fix, I now had a bunch of early-warning tests to run on my merge request pipeline to double check that the fix didn’t b0rk the build.

The Small Print

Despite one of the answers on aws:rePost mentioning it in passing, and despite the documentation talking about SSL, in passing, I hadn’t cottoned on to the fact that AWS serverless Redis requires a special config in the client to turn on TLS.

I half feel like there should be a warning in the AWS docs, but then perhaps I’m the only idiot who doesn’t turn on TLS in their Redis clients.

Who knows.

However, the outcome of this thing going wrong is positive:

  • Tim was awesome
  • The issue never escaped dev because the integration tests caught it
  • We responsibly backed out the change while fixing it – causing only a few minutes of team disruption
  • I had a reasonable use case to introduce a few contract tests to the module before a dev deploy (and indeed after)
  • The issue was resolved using science

So, read the small print and use TLS.

Leave a comment