'Best Practices' are the IT world's superstitions
runbook.cloud is built entirely on AWS Lambda. As the first application I’ve worked on built 100% on serverless technologies I’ve learnt a great deal along the way. One thing in particular I’ve learnt: don’t be dogmatic about ‘best practices’.
Early on in the build process for runbook.cloud I realised that the price model we had come up with and our Lambda costs were out of sync - we were going to lose money on a high percentage of customers if we pressed ahead with the design as it stood. We came up with a solution which massively reduced cost without significantly increasing complexity - and did so by ignoring one of the main serverless orthodoxies.
Serverless functions should almost always be a single thread, single task
One of the fundamental principles I had when coming up with the concept for runbook.cloud was that beyond pointing the app at the relevant AWS account(s), there should be no configuration required. If you have to tell us what needs monitoring we’ve failed - after all, part of the pitch is that we diagnose and suggest remedies for problems you might not even know were possible. In order to do this, we have to call a variety of AWS APIs, across all regions, for each customer. We do this every five minutes, so that the maximum delta between a customer spinning up a new resource and us monitoring it is as short as possible.
This poses a problem if you, as we did, assume that each Lambda should do one thing at once. Currently AWS has 15 standard regions. As originally written, we were checking each region for each customer in a separate Lambda invocation - giving us a total of 180 invocations per hour per customer. These Lambdas were taking between 500 and 4000 ms to execute. Forgetting about the generous free tier, that meant that just resource discovery alone was going to cost us over $55 per month per customer - something that clearly wasn’t going to work with our planned pricing.
Programming languages have attempted to solve concurrency in their respective runtimes. FaaS platforms invert the problem.— Kelsey Hightower (@kelseyhightower) August 22, 2018
Fortunately for us, we made the decision from the outset to write everything in Go. Go was selected primarily because of a belief that it is the language that best lends itself to code readability, but it has another feature that is rightly very popular - Goroutines. Goroutines are a way of writing code which can run concurrently, whilst letting Go handle how many threads should actually be running at once. It’s perfectly reasonable to have hundreds of thousands of Goroutines running concurrently, especially when most of these are blocked waiting on a response, as in our case.
In a single morning we refactored the code to use a single Lambda invocation every minute, operating on 20% of the customer base each time. This Lambda spawns a Goroutine per customer, which spawns a Goroutine per region, which spawns a Goroutine per AWS service. The Lambda execution time hasn’t increased significantly, because as before we’re mainly waiting on network I/O - we’re just waiting on a lot more responses at the same time in a single Lambda function. Cost per customer is now much more manageable however, becoming lower and lower with every sign up.
Since refactoring in this way, we’ve started using Goroutines throughout the code base. In particular, when calling DynamoDB to fetch multiple items across different partition keys, doing this concurrently has brought a significant speed up.
I’m certainly not advocating for large monoliths in Lambda. I’d definitely agree with the idea that serverless functions should conceptually be doing a single task. I think the idea that a Lambda should be single threaded to achieve this is wrong, however. That means that selecting a programming language to write Lambda functions in based on this assumption is also incorrect.
Most of all - don’t assume ‘best practice’ is always the best way for your use case.