Monika
Monika works as a SRE @Cloudflare. She is passionate about observability, k8s and databases. Outside of work, she tries to keep up to the energy of her toddler.
Session
At Cloudflare, we use Prometheus heavily. We have point of presence (POP) in more than 285+ number of cities and each POP have their own Prometheis. All these Prometheis send alerts to a central Alertmanager. We have various integrations to route the alerts. We also route all the alerts to a datastore for alert analytics.
The talk covers
- What is Alert analytics?
- Why is alert analytics important and what problems does it solve?
In this we will discuss
* The importance of trending alerts.
* Identifying noisy components, why are they noisy.
* Bottlenecks - hardware/application.
* Correlation with releases/incidents.
* How analysing time to resolve alerts gives improvement opportunities.
- How to do Alert Analysis using open source tools.
- Conclusion.