Our (not so smooth) journey productising a serverless app on AWS
Serverless is the future. I'm sure of it. The cloud's great promise is to help companies focus on business value rather than infrastructure - and serverless is our best take delivering on this promise so far.
But the current reality of serverless is something I am far less sure of. Navigating the jungle of Lambda Functions, API Gateways, S3 Buckets, VPNs, VPCs, Cloudfront Templates, Log groups, Internet Gateways, Policy Attachments, IAM roles and all the other bits and bops required does not quite feel like what we had in mind when we envisioned the simple, self-orchestrating, serverless world of tomorrow.
That's not to say that serverless on AWS is bad - it isn't. Once everything works, it works beautifully. But one can tell at every step along the way that its ecosystem and associated best practices are still young - and in large parts immature.
To show you what I mean I've written this account of our winded and at times frustrating journey building, testing, deploying and running Arcentry as a serverless app on AWS.
What is is we were building
Arcentry is an app that lets users create isometric diagrams of cloud and open source architectures. While its frontend is a complex, feature-rich mix of HTML and WebGL components, its backend is actually fairly straightforward: Account and user management, document creation and updating, payment via Stripe, a bit of metric collection and a nifty little feature that lets users raise suggestion directly as Trello cards - that's it.
All in all, nothing out of the ordinary. So how did we manage to have such a hard time getting it going as a serverless app?
The humble beginnings
We started by doing the last thing you'd want to do: We used AWS' resources directly. We structured our services as snippets of code that could run as Lambda functions and wrote bash-scripts to zip them, upload them to AWS and configure the necessary timeouts, memory allocations and so on.
Of course, Lambda functions don't do much by themselves, so we extended our script to set up the necessary API Gateways - the internet facing endpoints that translate incoming HTTP requests into Lambda-invocations.
Our scripts were already getting quite complex at this point, but it wasn't until we started looking into execution roles, VPCs and deployment stages that it dawned on us that we were on the wrong path. Everything we've been scripting in days of development time was just the basic scaffold for a run-of-the-mill HTTP service - surely a generic enough task to have already been solved by someone else?
It turned out - it was. There are a number of general purpose cloud definition templating languages and frameworks, such as Hashicorp's Terraform or AWS' own Cloudformation, but we choose the more purpose build serverless framework instead.
Things sped up significantly once we've adopted it. Serverless lets you write a single definition file that specifies your functions, associated URLs, HTTP methods and any number of cloud provider specific settings such as zones, timeouts and network configs. Then, using a single command, serverless converts your code and configuration into the necessary cloudformation templates, including all auxiliary aspects such as roles, gateways, log groups and so on, zips it all up and deploys it to Amazon - beautifully and highly recommendable.
How to develop and test a serverless app
With our initial scripting woes out of the way, we were finally free to start thinking about development practices, test strategy, and the "definition of done" for our brave new serverless world. Please note: The following is simply what worked for us - I'm not making any claim to have discovered the universal principles of serverless development
Don't bother working locally There's seems to be a preference amongst developers to replicate entire backends on their local machine. This is hard enough with traditional and containerized apps: As your stack grows it becomes increasingly resource hungry - and with each additional component having a certain chance of working differently in local than in server conditions the likelihood of your local backend accurately portraying the server environment shrinks as complexity increases.
With serverless apps, this is even more true. Serverless apps are natives to the cloud and utilize a network of other services and concepts. Our advice - don't bother replicating any of this locally. Instead, use API Gateway stages to maintain separate dev, test and prod environments and deploy your function after every change. This sounds like a lot of overhead, but serverless' per function deployment and logging capabilities mean that there's only a 5-15 second wait until your code change is ready to test.Test on an API level Unit testing assumes that the parts of your code can be tested in isolation - with external dependencies adhering to strict contracts that can be simulated for the test That is not necessarily the case for serverless apps - your lambda function is the unit of code - and external dependencies tend to be cloud services with complex and dynamic behaviors.
We thus avoided unit tests and tested directly against our HTTP API, sending requests and evaluating responses. This also allowed us to bundle tests into suites, testing more comprehensive user journeys (signup -> login -> create first document -> place item -> upgrade to paid -> downgrade to free -> delete account) etc.Deployment? Already done The benefit of the approach described above is that you don't really need to think about deployment - because your development environment is already your deployment environment. All it takes is to create a separate version of it - in our case by using a fully separate AWS account with its own, more powerful DB cluster and separate resources and point your serverless config to it. Or at least so we thought - turns out there are actually quite a few obstacles to overcome:
Trouble ahead
Armed with our new best practices, development progressed quickly. We kept churning out function after function and our serverless configuration file grew nicely- until one morning it didn't.
We were just about to add our 35th or 36th endpoint when we were greeted with this error:
The CloudFormation template is invalid:
Template format error: Number of resources, 201,
is greater than maximum allowed, 200
Wut? Serverless uses cloudformation templates under the hood - and it turns out that there's a maximum of 200 items per template. Since every Lambda function comes with its associated Version, LogGroup, Permission, Ressource, Method, RestAPI, and Role, this really isn't much.
Unfortunately, this problem is not at all trivial. There are a number of ways to overcome it, but all of them require a fundamental restructuring of one's project. For us this meant we had to break up our single deployment into many individual services, each with their own dependencies, serverless config, API Gateway etc.
Storing Secrets the hard way
So far we've stored our secrets (database passwords, stripe keys etc) as Stage variables for the API gateway - this had the benefit that we could easily switch between stages (dev, test), each with their own database and endpoints. But now, after restructuring the project - we had more than ten API Gateways - one for each service group - and copying our most secret information around ten times didn't seem right. We thus begrudgingly moved all our information to the AWS Secrets Manager - an encrypted key-value store that's global to all services. It works well, but every single function now has to start by making an HTTP request to the Secrets Manager to retrieve its passwords...hm...
To VPN or not to VPN
The single most sacred part of most apps is the database - and Arcentry is no different. To ensure maximal safety we made our database inaccessible from the outside and moved it - along with all Lambda functions and API gateways - into a Virtual Private Cloud (VPN). This worked well - but suddenly our image upload seemed broken. Turns out that AWS sees its storage solution (S3) as an external service that requires internet access for the VPC - which in turn means configuring multiple subnets Internet and Intranet gateways in just the right way. This is a surprisingly poorly documented and extremely low-level task that every serverless app that wishes to access the internet has to go through.
So - would we do it all again.
These are just some of the challenges we had to overcome - naturally, there were more. But we overcame them - and learned an awful lot along the way. Today, Arcentry is humming along nicely. It's only been public for two weeks - and while there are days with only a handful of users signing up there are others such as the day we made the Producthunt frontpage and hundreds of users poured into the app. In either case our backend scaled up and down nicely and without any monitoring or intervention on our side - and with that we couldn't be happier.