Auto-Scaling Build Agents for GoCD

Published in

engineering @ amaysim

7 min readMar 5, 2018

At amaysim we use GoCD as our tool of choice for orchestrating our Continuous Integration and Continuous Delivery pipelines. Like many other CI/CD tools, it works well for a few small builds or projects, but can hit challenges once you reach a certain level of scale. With multiple teams working across many different projects and a growing eco-system of micro-services, we quickly hit a choke-point and had to employ a few interesting techniques to allow us to continue to scale effectively.

Why GoCD?

Before we delve into these techniques, let’s take a step back and look at why we chose GoCD.

GoCD has been an effective addition to our toolset since it was introduced here last year (as a replacement to the sadly now defunct Snap-CI). In fact, it is now used by all of our delivery teams to build, test and deploy a wide variety of applications and software. When choosing it we saw that it primarily gave us three key advantages when compared with other similar products in market at the same time:

Everything natively runs as pipelines
It’s free and open source
It’s highly compatible with the our 3 musketeers methodology

It was the right tool at the right time for us in our ongoing improvement cycle.

The Problem: Deployment Velocity and Scaling Agents

CI/CD tools provide huge amounts of value to any organisation serious about delivering high quality software on a frequent basis. But left unchecked and with high growth, they can quickly become bottlenecks and contention-points for teams and organisations.

Continuous Delivery and Continuous Improvement are two of our key tenets at amaysim and looking at how we can improve our deployment velocity is always an area of focus for us. To this end, we even have a dashboard dedicated to tracking deployment velocity which we use to monitor and help track down and isolate bottlenecks, builds that are candidates for optimisation and the overall health of our delivery pipelines.

GoCD operates using a fairly common server/agent model. Each agent can run one stage (a step in the pipeline) at a time. For simplicity and performance, we run 1 agent per EC2 instance, with instances residing in an Autoscaling Group (ASG). Agents are idle when no stages are assigned to them. During times of low activity, we still pay for those idle agents, and at peak we can use more than 10 agents at a time.

Without enough agents, there will be contention for build time and deployment velocity suffers. The solution? Run exactly as many agents as we need. However;

We need some kind of metric to scale from. Average CPU utilisation across the ASG is often used as a scaling metric, but in this case it doesn’t necessarily correlate to demand on the platform. Builds and deployments are quite often not CPU-bound, but instead are limited by the speed of external systems (CloudFormation, NPM repositories, etc.). Because of this, CPU utilisation can be quite low even when there are no free agents to start new builds with.
Agents cannot be terminated during a build. This is a problem because we need to scale in at some point and reduce the number of running EC2 instances. AWS does use some logic to determine which EC2 instance to terminate, however AWS does not know which EC2 instances are in the middle of building or deploying something. If we were frequently terminating instances that were mid-build, it would be disruptive to engineers that rely on the platform to deploy.

Scaling Metrics

The metric I chose to scale off was “number of builds currently running”. Luckily, the GoCD server exposes this through an API:

https://api.gocd.org/current/#get-all-agents

By counting the number of running and idle agents (with a little bit of curl and jq) and then emitting it to CloudWatch with aws cloudwatch put-metric-data, we now have metrics to enable scaling in the ASG.

Preventing Scale-in

We now have a system that can bring the number of effective agents in-line with the number of required agents. But, AWS has no idea which instances it’s allowed to terminate. If it starts terminating agents at random, there’s a chance it can terminate an agent that is part-way through a build, which will cause confusion and uncertainty with the Engineering team.

There were a few solutions to this problem;

Let the ASG terminate instances at random. If it happens to pick an agent that is building, remove it from the ASG and then terminate it later once we’ve confirmed it’s no longer building (similar to connection draining). A combination of CloudWatch Events and Lambda could potentially used here.
Enable termination protection on any agents that are building and let the ASG terminate any instance that isn’t protected.

At this point, I still did not know whether it was going to be feasible to scale GoCD agents. The benefits of elastic scaling in the cloud are many, but scaling an application that isn’t cloud-native is a huge hurdle — often seen as impossible. And so, here is my key takeaway from implementing this:

Some things are worth doing even if the solution isn’t perfect. Even if something looks impossible or unrealistic at first glance, running a spike with the simplest possible implementation can have surprising and rewarding results.

Implementing the second solution was trivial, consisting of two API calls in a while true loop:

I then dockerised it, set it to run on every agent and then enabled scaling.

It Works!

The result was surprisingly effective, enabling reasonably aggressive scaling while having almost zero contention for agents. We’ve been running it for several months without any issues. You can find full the code here:

https://github.com/amaysim-au/docker-gocd-scaler

It’s not perfect, so there’s a few things that can be improved on:

The current scale-in protection script polls the server’s API from every agent, putting a fair bit of load on the server. At one point we had to scale up the server to deal with the additional load. Having this process run on the server and polling it once would require a rewrite with concurrency.
Written in bash, I’d like to rewrite it in Python. This would improve performance, enable more complex functionality such as async processes, allow us to add tests and make the solution more maintainable.

What Else?

Improving the Engineering experience and our Engineer’s workflow tools to allow them to seamlessly push features to production is a core component of our DevOps function at amaysim. There are many other improvements and tweaks we’ve made to our deployment pipelines to make them quicker, easier to use and more reliable:

A Lambda to clean up agents in a disconnected state
Using EFS to maintain state on the server, enabling the server to be terminated at any point without losing data which improves reliability
On any projects that use NPM, zip up the node_modules directory before uploading it as an artifact to prevent slowdown of EFS due to lots of small files
Github Pull Requests Builder and Github Pull Requests status notifier for PR builds
Build-Watcher Notification Plugin. This plugin is particularly great because it will directly message those involved with the latest commits rather than spamming a channel.

Going Forwards

As you’d imagine, there’s still plenty more optimisations and improvements we’d like to make to our GoCD toolset. A few ideas we’ve been considering include:

Pipeline as Code: I’d like to be able store the configuration of each pipeline within the repository of the related project. As of writing, the current version of the pipeline as code plugin for GoCD forces you to store all environment variables in code which makes it a step backwards for us.
Building in Docker: running agents as Docker containers is a challenge for us as the 3 musketeers approach requires direct access to the Docker daemon. Docker-in-Docker potentially solves this problem.

GoCD has been a very useful tool for our ongoing push towards Continuous Delivery across the entire amaysim stack. Like many similar tools, it does come with some limitations that we’ve been able to work around or refine. This has allowed us to continue to improve the Developer experience and deliver on one of the core DevOps objectives of helping provide reliable and scalable self-service tooling to allow for the smooth flow of features and updates into production.

Has any of the above piqued your interest? Does amaysim sound like the sort of place where you think you could make an impact? Do you thrive in organisations where you are empowered to bring change and constant improvement? Why not take a few minutes to learn more about our open roles and opportunities and if you like what you see then say hi, we’d love to hear from you..

Shout-out to all the lawyers..

The views expressed on this blog post are mine alone and do not necessarily reflect the views of my employer, amaysim Australia Ltd.