Rollback-by-Design: Safe Infrastructure Migrations at Scale
2026-05-05 , Inspiration A/B

We migrated our production Kubernetes clusters from Cluster Autoscaler to Karpenter, but with thousands of nodes and hundreds of services, unknown problems were inevitable. This talk covers how we built a rollout strategy where "pause" and "rollback" were as routine as "continue": explicit go/no-go criteria, staged learning with hand-picked canary batches, and tiered rollout with gated alarms. The result: double-digit cost reductions, and a pattern we've reused for many more high-risk migrations.


Description

Karpenter is a newer node autoscaler that promises faster scheduling and better cost efficiency. But this kind of migration touches everything: scheduling latency, scale-down behavior, and disruption semantics across every workload. Just one of our production clusters has thousands of nodes and hundreds of services, this meant unknown unknowns were inevitable.

We needed two things: a way to collapse months of unknowns into days, and a rollout strategy where stopping felt as safe as continuing.

The approach:

First, explicit go/no-go criteria defined before starting, not discovered mid-rollout. The talk covers what those criteria looked like for an autoscaler migration, and how we got stakeholder alignment on "bail is a valid outcome."

Second, staged learning: synthetic tests for behaviors we knew were problematic, staging to remove doubts, then a canary production cluster where all the real iteration happened. Within that cluster, we used a hand-picked batch of services chosen to surface different classes of problems fast. The talk walks through how we selected that batch, and what each category was designed to catch.

Third, tiered rollout with explicit gates: expansion by service criticality, percentage-based phases, and dedicated alarms tied to our go/no-go signals. When thresholds crossed, on-call was notified and the decision to revert was obvious.

What we found:

Karpenter was supposed to be faster, but we found consolidation scenarios where it was actually slower than Cluster Autoscaler. We found workloads silently assuming pods would live "long enough." These weren't edge cases; they would have broken at scale. The selective batch surfaced these issues early, and we fixed them before the broader rollout even started. The talk covers these discoveries, how the batch selection strategy caught them, and what we did about them.

The result: double-digit percentage reductions in idle costs, and a pattern we've since reused for other high-blast-radius migrations.

Attendees Will Take Away

  • Go/no-go criteria: How to define must-haves and negotiables before starting—and get stakeholder alignment that "bail" is a valid outcome
  • Canary batch selection: How to pick services that surface different classes of problems fast (non-consequential baseline, critical services with aligned teams, intentional stressors)
  • Gating signals: Which metrics to watch (scheduling latency P99, sustained pending pods, service disruptions) and how to tie them to batch-specific alarms—so on-call isn't guessing, just responding

I'm a Staff Software Engineer at Lyft in NYC, where I lead a compute infrastructure team managing Kubernetes on AWS and supporting ~1000 services across infrastructure, data platform, and product. Previously, I was a Principal Software Engineer at Red Hat, where I built Kubernetes control-plane components and installers used to manage fleets of OpenShift clusters across cloud and on-prem environments.