2025-09-15 –, Bierstadt Lagerhaus Stage
Our platform team began with similar goals as other platform teams which is to unify and streamline deployments of all the teams under our organization. Then we got hit by a curve ball as life always does. We got acquired and this began the process of cloud migration. Cloud migration with K8s made things easier but there was more sinister problem lurking around. How do we create, destroy or retire clusters without disrupting teams?
This talk tells the story of how we replaced a patchwork of terragrunt based workflows with CAPI - Cluster API a kubernetes native framework for creating, updating or even deleting clusters at scale. You will hear the ups, downs and the cultural shifts that came with migrating infrastructure typically written in terraform to yamls.
This talk will follow the Shakespearean approach and is divided into 4 acts:
Act 1: How we got here? (Terraform nightmare)
- Our platform team began with the humble gitops approach where each "service domain" would get their own clusters by the means of generating terragrunt hcl files using a tool called boilerplate. We created value files similar to helm charts, which the platform team would use to create clusters.
- This worked fine for couple of service domains but as soon as we scaled up, cracks started appearing. State management became a nightmare and working with K8s resources through terraform manifests became more and more complex.
- Coupled with this, we also got pounded with cluster upgrades for which we used blue/green strategy which meant now more clusters needed to be created through terraform.
- Cloud migration also presented us a problem where we had to run clusters in two different clouds at the same time but we would have to maintain two sets of terraform codes to create clusters.
Act 2 - Enter the protagonist - CAPI
- Cluster API began as an experiment in our platform team, to solve the problems mentioned above. The goal of the PoC was to showcase that we could spin up clusters from a single hub cluster without the use of terraform in a deterministic way.
- At the same time, we wanted also to checkout cluster upgrade and destruction strategies and if it aligns to what our platform team already supported for our tenants.
Act 3 - The curtain opens - CAPI in production
- Management cluster is brought in by the use of terraform with argocd bootstrapped. But gone are the days of managing terraform code. Now clusters are literally cattle and their definitions are stored in a Git repo in yamls as determined by the platform team.
- As we adopted CAPI, we also realized that it helped us keep our platform code cloud agnostic. We could now manage clusters across providers without worrying about provider terraform modules and their gotchas.
- We also discovered as we moved CAPI into our platform, that we could use healthchecks to make sure our platform as a whole remains healthy and the platform team gets paged if something goes wrong.
Act 4 - How's it going?
- One of the main goals we achieved using CAPI was streamlining our cluster processes specifically cluster creation. Not only did we simplify the process, we also brought down the time taken to create the cluster by 60%.
- This also meant cultural shifts for the platform team. As we all know, change is hard. Our platform team was used to writing out boilerplated hcl files to create clusters and define node groups in terraform. It took a while and a huge load of internal demos and confluence docs for us to move into the CAPI approach. In our opinion, this might be the most significant decision for a platform team since it not only involves learning a new tool but also moving away from the traditional IaC.
- Audit: We as a platform team highly recommend CAPI due to its auditability. We had cluster definitions in gitops which the platform team could always look at.
Senior DevOps engineer at Walmart focused on building K8s based platform for cloud agnostic and gitops based infrastructure.