06-23, 16:10–16:35 (Europe/Amsterdam), Grote Zaal
At Cloudflare, we use Prometheus heavily. We have point of presence (POP) in more than 285+ number of cities and each POP have their own Prometheis. All these Prometheis send alerts to a central Alertmanager. We have various integrations to route the alerts. We also route all the alerts to a datastore for alert analytics.
The talk covers
- What is Alert analytics?
- Why is alert analytics important and what problems does it solve?
In this we will discuss
* The importance of trending alerts.
* Identifying noisy components, why are they noisy.
* Bottlenecks - hardware/application.
* Correlation with releases/incidents.
* How analysing time to resolve alerts gives improvement opportunities.
- How to do Alert Analysis using open source tools.
- Conclusion.
Monika works as a SRE @Cloudflare. She is passionate about observability, k8s and databases. Outside of work, she tries to keep up to the energy of her toddler.