Managing Kubernetes clusters at scale: automation to the rescue

Automation, DevOps, Kubernetes

2020-10-28

Kubernetes Clusters

At Skyscrapers we manage multiple Kubernetes clusters for many customers. These clusters are based on our Kubernetes Reference Solution, a fully battle-tested and proven runtime platform for SaaS workloads.

Managing these clusters is not only about doing operations, but also about maintaining them: performing upgrades, roll-out new components and features, etc. And we’re not talking about just 1 cluster, but many and most of them running production workloads.

Our key principles require us to be scalable as an organisation while, at the same time, offering consistent high quality service. How to do that as a growing company? Well … at Skyscrapers, we invested quite a bit in automating a lot of this maintenance work.

In this article we’d like to share some of that experience with you.

Infrastructure as Code

At the very heart of it all we use Infrastructure as Code to package our solutions. Our technology of choice for years has been Terraform.

Focussing on the Kubernetes platform, we divide our Terraform into several “stacks”. Building further up on the modules concept, we define this as a full “deployable” unit. For more information about the concept you can consult our documentation.

At the minimum, for a full setup we have the following:

networking: Provides the VPC, generic subnets, NAT, route tables, etc.
eks-cluster or aks-cluster: Sets up the basic AWS EKS or Azure AKS cluster and some related resources like auth providers, configuration of core kube-system components etc. For AKS the Node Groups are also defined here.
eks-workers: This stack is EKS specific and instantiates one or many Kubernetes Node Groups (basically an AWS AutoScalingGroup). Thanks to new Terraform features in 0.12 and 0.13 we plan to merge this back into the cluster stack.
addons: This one is the largest beast and responsible for the management of all addon features. This one deploys things like Ingress, our monitoring and logging stack, cert-manager, external-dns and so on.

Cluster definition file

To provide a single source of truth to feed parameters into the several Terraform stacks, we define a single YAML which is fed into Terraform. An example of such file:

meta:
  CLUSTER_NAME: production.eks.example.com
  CLUSTER_TYPE: eks
  RELEASE_MODEL: stable

tfvars:
  common:
    aws_region: us-east-1
    teleport_token: ""

  cluster:
    k8s_base_domain: eks.example.com
    enabled_cluster_log_types: ["audit"]
    nodelocal_dns_enabled: true

  workers:
    - name: spotworkers
      instance_type: m5a.xlarge
      autoscaling: true
      min_size: 3
      max_size: 6
      spot_price: 0.1

  addons:
    sla: production
    kubernetes_dashboard_enabled: false
    vertical_pod_autoscaler_enabled: true

Concourse CI/CD

Now that we have the basic building blocks with Terraform and the Definition File, it’s time to piece everything together. Our choice here was to leverage Concourse CI to both generate and run pipelines for each of our customers.

When I say we use Concourse to generate its own pipelines, I mean exactly that. One step further, you could say we do this Inception style with multiple layers:

We have a pipeline-generator task that loops over all defined customers and generates a new customer-pipeline-generator job for each of our customers.
The customer-pipeline-generator in its turn then will generate specific pipelines for each of the enabled components, like Kubernetes, Teleport, Vault and so on.
Each of these specific pipelines is then responsible for managing deploys of changes on the related components.

Since Concourse pipelines definitions are all written in YAML, we use spruce in our generators to piece all bits and pieces together.

A simple example:

find ./customers/ -name "*-meta.yaml" -exec sh -c 'spruce merge --prune meta customer-resources-part.yaml customer-jobs-part.yaml {} > {}.final' \;
spruce merge base-pipeline.yaml customers/*.final > $BASE/concourse-stacks-pipeline/pipeline.yaml

As an and result we get something like this:

Kubernetes platform lifecycle management

Release process

Now, how do we get from commit to production deploy?

We use 2 channels in our process: insiders and stable.

The insiders channel just follows the master branch in our git repositories. Every commit to master, usually through small, reviewed PRs, triggers the pipeline for all insiders clusters. This is our own internal production cluster and sometimes some extra test clusters.

These changes then get tested and validated on the insiders cluster(s), via a combination of K8s conformance tests and manual validation.

Once we deem the change ready, we create a new GitHub “release” of our packages. As a result this will trigger all the pipelines following the stable channel.

As a plus, changes are most of the time safe to roll out, since we put them behind feature flags to be specifically enabled on a per customer/cluster basis.

The future

Unfortunately, at this moment the actual deployment of new features, bug fixes. etc. still needs manual intervention: a human validates the planned changes for each cluster before running the deploy jobs.

This system has served us quite well, but our growing number of customers combined with a growing maturity in testing and releasing is making us look towards further automation from commit-to-production with as little human intervention as possible.

Some improvements we are already thinking about: