The runbook.cloud blog

Five AWS problems you didn’t know you could have

I created my first AWS account in May 2006, and since then things have changed immensely. At the time there were just two services - S3 and SQS. As I write in August 2018, there are nearly 100!

With all this extra power there is inevitably some extra complexity. This often leads to unexpected results. I’ve compiled a brief list of some of the things even those who have been using the platform for a long time might be surprised to discover can bite them, generally at the most inopportune time.

You shouldn’t be surprised to learn that runbook.cloud can automatically detect every one of these issues, as well as more than 300 others, as well a making recommendations for you on how to solve them. Better still, it can do it without any specific configuration - just add runbook.cloud to your AWS account and it will auto-discover what you have and what checks need to be run.

Lambda silently unable to run new functions

If you’ve read through the Lambda docs, you might be aware that there is a limit to the number of Lambda functions you can run concurrently - 1000 at time of writing. AWS introduced concurrency limits to allow you prioritise some functions over others when working within this limit.

What isn’t generally as well known is that if you’re running Lambda inside of VPC, you can hit other limits much more quickly. Worse, when Lambda hits these limits it will simply fail to invoke new functions - silently, logging no error messages. These limits are why the Lambda documentation contains this sentence: Don’t put your Lambda function in a VPC unless you have to.

One of the two things you should worry about here are the number of ENIs (Elastic Network Interfaces) you have. By default, your account will be limited to 350. At least one ENI will be created and used for each resource in your VPC. An EC2 instance will have at least one, so will an RDS instance, and an ELB will have at least one per availability zone. Whilst a Lambda is running inside VPC it will have one ENI per concurrent execution. This means if you have Lambda serving 300 users simultaneously, you’ll be using 300 ENIs just for Lambda. It therefore becomes pretty easy to hit this limit. Fortunately, this is a soft limit you can ask AWS support to raise for you.

You can also get into the same situation if the VPC subnet your Lambda function is running in has run out of IP addresses. Every ENI uses at least one IP address. AWS recently added the ability to add secondary CIDR blocks to your VPC subnets to deal with this issue.

EBS volume failure

It’s so rare that most people don’t even realise it can happen, but EBS volumes can and do fail. Amazon state a failure rate of between 0.1% and 0.2% per year, although not all failures result in complete loss of the volume. Because of the rarity of the problem, it often brings much head scratching and confusion, particularly in the case of a partial failure which just worsens performance but continues to function.

If you’re unlucky enough to fall prey to an EBS volume failure, your only option is to restore from an EBS snapshot, or another backup. EBS snapshots are stored on S3, which has very, very high data durability - 99.999999999%!

RDS volume failure

An extension of the problem above - RDS instances (Aurora excepted) are using EBS, and so can fall prey to the same unlikely situation. If you’re running RDS in multi-AZ mode, your data is replicated to a completely separate set of EBS volumes in another data centre, so whilst it might trigger a fail-over event, it won’t be so catastrophic.

Elasticache can hit CPU limits with plenty of free CPU

This problem is specific to the Redis variant of Elasticache, and was so difficult to spot AWS introduced an additional metric specifically to help find it (EngineCPUUtilization).

The issue here is that Redis is a single-threaded application, and can only use a single CPU core. The CPUUtilization statistics published to CloudWatch by the Redis service show utilisations for all cores - meaning you can be throttled on CPU even though this graph is showing only 12.5% CPU usage.

Availability Zones can fail

This might be less surprising that the other items in this post - after all, the AWS SLA is quite specific that the SLA only applies when more than one AZ is down for your account in a given region. What might be more surprising is that availability zones with the same name can be completely different per account. This is true both in ordering - for example, eu-west-1a in one account might be eu-west-1b in another, and also in terms of completely different data centres. Many regions in AWS have grown so large that they now span many different buildings, and it is possible for two accounts in the same region to have no overlapping facilities. Don’t worry though - they will still be connecting by the same, insanely high capacity inter-AZ network, so VPC peering will still function at very low latencies.