DevOps Anonymous – Lessons Learned from Building Cloud Infrastructure from Scratch

I was a software engineer at that time, mostly working with Java. But I enjoyed listening to DevOps people having problems. They all loved me because they thought I also received alerts in the middle of the night; I didn't.
Featured story Blog
DevOps Anonymous – Lessons Learned from Building Cloud Infrastructure from Scratch


I was sitting at yet another DevOps Anonymous (DA) meeting. I was not actually a DevOps (yes, I know that having a DevOps role misses the point of the DevOps approach, but I’m trying to tell a story here, so be quiet). I was a software engineer at that time, mostly working with Java. But I enjoyed listening to DevOps people having problems. They all loved me because they thought I also received alerts in the middle of the night; I didn’t.

It was Bob’s turn:
“I was pushing to automate things from the start, but nobody listened. Everyone was making ad hoc changes, solving the problem at hand. Nobody was documenting anything. Until one day we had to replicate the environment for a different customer. It took us three months to untie everything and set it up again.” – Bob’s eyes glared with tears.

I was enjoying these meetings. That was my vacation. Until recently. Two years ago I started being more involved in how/where/when my code was being deployed. And a year ago I began deploying other people’s code. So, attending DA meetings stopped being my guilty pleasure; I became a full member with my own struggles. And this is my story.

My Turn

“I’m Milan, and I’d like to share an experience from our newest endeavor. The idea for the project was to automate all the infrastructure creation right from the beginning. The AWS infrastructure was deceitfully simple: one master region with MongoDB replica set, CodePipeline for CI, CloudWatch for monitoring and alerting (and several smaller supported components and services); and three satellite regions with ECS services running NodeJS app in Docker containers. The app services in all regions communicate with each other using RabbitMQ (running in the separate containers). Quite simple, right?” – it was my first time to share something in the DA meetings after three years of attending.
“We estimated the work to be done in six or seven weeks… How wrong were we. But here are the lessons we learned along the way.”

  1. Even simple and well-documented services in AWS take long to automate… Or do they?
    Yes, they do.

But let me first introduce the tools we were using.

CloudFormation is a native AWS language for describing resources that need to be created. Since AWS develops it, the assumption is that it supports the majority of services and configuration options that we need. However, since we’re also using Ansible for the provisioning of the database component (we already had the roles we needed), we decided to use CloudFormation inside of Ansible wrapper; which turned out to be the right combination, especially when we had troubles making a cross-region VPC peering connection through CloudFormation for some reason (probably because some parameter for CloudFormation template was not documented well). Anyway, it worked out with Ansible’s ec2_vpc_peer module.

So, CloudFormation and Ansible (with ec2 dynamic inventory). Why is it so time-consuming to automate everything from the start?

  1. Creating VPC and EC2 resources in AWS takes a lot of time. It takes 5-10 minutes for AWS to create a VPC(s), bastion server and all the EC2 instances that we need. Almost the same time to delete it. Therefore, making and testing changes while developing often involves deleting the stack and recreating a new one which results in 10-20 minutes of waiting for the resources to be created. And if I make an error, like passing the wrong reference, the creation fails (sometimes after 10 minutes) and rollback starts, which, again, takes some time. All in all, experimenting can be very time-consuming.
  2. When you’re creating AWS resources through web console UI, AWS often creates depending resources in the back. For example, if you want to create an ECS service, load balancers will be created and configured for you automatically by AWS. On the other hand, when you’re creating ECS through CloudFormation, you have to create and correctly configure all the backing resources (such as load balancers) yourself. It’s not rocket science, but it takes time to read the docs about a resource, try it out, decide on a proper configuration and integrate with the rest of the stack; then test and repeat.

2.  Make it work – then automate it

You are familiar with that picture of how Spotify builds products. In hindsight, the same principle applies to building and automating cloud infrastructure (and probably for everything business related). We should focus on providing value from the start. First, let’s build the infrastructure by hand, by clicking through AWS web console UI.
That way, we will provide real, customer-facing value from the start. Then, we automate the creation of one component at a time. There is an overhead when you have to maintain the infrastructure that is built one half by scripts and one half by hand, of course. You have to manually edit the scripts and insert the IDs of the resources created from UI. It’s not as simple as it sounds (if it sounds simple at all). But I think that getting the product/feature to the market as early as possible is worth the struggle.

3. Optimize for debugging

Since experimenting with cloud resources takes a lot of time (because the creation/changes/rollback takes time), Ansible/CloudFormation scripts should be optimized for debugging and experimenting. This means that, if you’re experimenting with one resource and there’s no need for the rest of the stack to be running, comment out the rest of the stacks. An astonishing observation, I know!
But too many times I was thinking like this: ‘Okay, let me try out a different parameter for MongoDB instances. I’m sure that all will be well and I’ll probably leave it as it is.’
Then I spin up all the resources to avoid losing time on commenting out the unnecessary ones. And then, 4 hours later I’m still making changes and experimenting and always waiting for the whole stack to be created/updated/deleted.

The same goes for Ansible. If I’m experimenting with a module that I haven’t used before, I should comment out the rest of the playbook. Even though usually there won’t be any effect on the tasks that haven’t been changed, Ansible will have to run them all, and I have to wait for the part of the code that I’m testing to be executed. I always think: “It will probably succeed this time, I don’t want to lose more time commenting out the rest of the playbook.” But it never succeeds on the first try (unless I optimize for debugging and comment out some code, of course, then it works flawlessly, and I have to uncomment the code again).

4. grep is your BFF
Tooling in DevOps ecosystem is not as advanced as heavyweight IDEs in Java world (like Eclipse or IntelliJ).

However, doing DevOps work means that we’re very close to the Unix command line. Therefore, utilizing the command line programs should be part of our workflow. And that is awesome! grep your best friend forever. Let that sink in. It is your best friend because it is the fastest, most flexible way to search for a place where, for example, an Ansible variable is used and/or defined. Once you learn to grep – it is forever. grep created almost half a century ago, and it’s here to stay. Can you say that for Eclipse, IntelliJ, Atom, Sublime, or other IDEs and text editors? Not so sure. Investing in learning several grep parameters and use cases definitely pays off.

Want to check where the “satelite_regions” variable is used and defined?
“grep -r satellite_regions .”
results with:

./infra.yml:        regions: “{{ [ master_region ] + satellite_regions }}”
./infra.yml:      with_items: “{{ satellite_regions + [master_region] }}”
./infra.yml:      with_items: “{{ satellite_regions }}”
./inventory/aws-dev/group_vars/all/vars/all.yml:satellite_regions: [‘sa-east-1’]
./inventory/aws-production/group_vars/all/vars/all.yml:satellite_regions: [‘ap-southeast-1’, ‘ap-southeast-2’, ‘sa-east-1’]
./inventory/aws-sandbox/group_vars/all/vars/all.yml:satellite_regions: [‘sa-east-1’]
./inventory/aws-staging/group_vars/all/vars/all.yml:satellite_regions: [‘sa-east-1’]

By the way, here’s a really lovely illustration of grep use cases (all credits to Julia Evans):

illustration of grep use cases (all credits to Julia Evans):

Since we’re already at the command line almost all the time, you can start using Vim and embrace the best text editor ever… Actually, let’s move on.

5. Hardcoding is fine and DRY means DO REPEAT YOURSELF
Imagine this very common scenario:
We need to create several (or dozens of) EC2 instances which are almost the same. They are probably different in just several parameters. In Java (or any other high-level programming language) that problem would probably be solved by creating a method with input parameters (those config options that have to be different). So, I was always tending to make this kind of ‘generic’ approach with CloudFormation/Ansible. Unfortunately, this does not apply here because CloudFormation is not a Turing complete language (it does not have a notion of loops).

Therefore, we have to repeat ourselves. This goes for hardcoding as well. We wanted too much to make everything as generic as possible, but that led to hate which led to suffering and so on. In the end, maybe my CF/Ansible scripts do not have to create SSL certificate dynamically. Why would I want to automate the creation of it? Do I need thousands of certificates? If I need only several and I can reuse them, I’ll create them manually and hardcode the ARN reference into the scripts/inventory and save a lot of time.

Conclusion – is automating an infrastructure worth it?
Short answer: yes, but we have to be smart(er) about it.

A bit longer answer
When we automate infrastructure we get all the advantages (and disadvantages) of maintaining and running the code:

  • ease of execution
  • errors (bugs) are systematically fixed
  • far less documentation
  • people are interchangeable on the project
  • easier to do experiments (this might be the strongest point. My next blog post will be about this.)

But we have to make sure to provide business value as fast as possible. Automating everything from the start can slow us down significantly (maybe even stop us). Therefore, having a fully automated infrastructure is the ultimate goal, not something we start from. We have to be aware that a compromise between automation and providing value is necessary, as depicted on this diagram (taught at business schools):

The Conjoined Triangles of Success (Barker, Jack, 2015)

To me, it is also about the feeling that I’m creating something great with a single command. Think about it. Just a few years ago it was unthinkable to run a single command and spin up a complex, distributed infrastructure with database and application services spanning 3 or more continents with everything being highly available and fault tolerant. If there’s an outage in one region, the requests will automatically be rerouted to another AWS region.

All of that is just beautiful… And we can use this technological marvel to entertain our customers with funny cat videos.