From Chaos to Consistency: Implementing a system to manage 1M+ SLOs in a Large Enterprise DevOpsDays Kerala

From Chaos to Consistency: Implementing a system to manage 1M+ SLOs in a Large Enterprise
.ical

2024-09-28 11:00–11:25, AI/SRE

Discover how we transformed chaos into consistency by standardizing and scaling SLO management for over a million SLOs in a large enterprise known for its professional networking platform. Learn about our journey from inconsistent availability measurements to implementing OpenSLO standards and achieving reliable 99.9% availability for end users across 1000s of microservices and diverse client applications.

In this session, titled “From Chaos to Consistency: Implementing a System to Manage 1M+ SLOs in a Large Enterprise,” we will explore the process of standardizing Service Level Objectives (SLOs) within a large organization known for its professional networking platform. We will begin by examining the initial state of disparate and inconsistent availability measurements across various services. The session will then delve into the initial challenges we faced, including a significant drop in our availability metrics. We will discuss our adoption of OpenSLO for defining SLOs and the subsequent improvements that led to restoring our user-facing availability to 99.9%. Our scalable system now supports millions of SLOs, thousands of microservices, and diverse client applications, processing billions of metrics to derive accurate Service Level Indicators (SLIs). Attendees will gain insights into the strategies and lessons learned from this journey, providing practical guidance for implementing SLO standardization in their own organizations.

Anoop Nayak

Anoop is a Staff Software Engineer at LinkedIn HQ in California, solving challenges within the large organization. He used to work as an SRE before this role on the LinkedIn web, mobile apps and the frontend APIs and has contributed to scaling LinkedIn from 500M to a 1B+ users. Anoop is particularly interested in client-side observability, including web, mobile, frontend APIs, and distributed systems.

From Chaos to Consistency: Implementing a system to manage 1M+ SLOs in a Large Enterprise .ical 2024-09-28 11:00–11:25, AI/SRE

From Chaos to Consistency: Implementing a system to manage 1M+ SLOs in a Large Enterprise
.ical

2024-09-28 11:00–11:25, AI/SRE