2026-05-06 –, Ballroom
Kubernetes clusters generate huge amounts of metrics, logs, and traces even when nothing is actually wrong. I wanted to see whether anomaly detection could help separate real problems from background noise. To test this, I built a small Kubernetes cluster and intentionally broke it in different ways. The results were mixed: some failures were detected early, while normal behavior often triggered false alarms. This lightning talk shares what anomaly detection is good at, where it struggles, and what DevOps teams should realistically expect when using it in Kubernetes environments.
This talk describes a small personal experiment combining chaos engineering and anomaly detection in Kubernetes. I instrumented a lightweight cluster with OpenTelemetry and used ChaosMesh to inject controlled failures such as pod deletions, CPU and memory stress, network latency, DNS failures, and forced autoscaling bursts. These failures reflect issues many teams see in real production environments. Next, I looked at the data coming from the cluster and used simple statistical techniques for anomaly detection to see what stood out as unusual. Some methods detected real issues quickly. Others raised alerts even when nothing was actually wrong. By comparing what the system flagged as anomalies with what was actually happening in the cluster, I was able to see where anomaly detection was genuinely helpful and where it failed without proper context. The talk shows that anomaly detection can be useful but only when combined with good observability practices and human understanding of system behavior.
DevOps engineer with experience designing, building, and optimizing cloud infrastructure. I work extensively with Kubernetes, infrastructure as code, CI/CD pipelines, and open source observability tools to improve system reliability, scalability, and operational efficiency in production environments.