SLA v SLO v SLI

Service Level Agreement (SLA)

SLA stands for Service Level Agreement. An SLA is a contract between a service provider and a customer that defines the level of service expected from the service provider.

Some key elements of an SLA include:

• Uptime guarantee: This specifies the minimum percentage of time the service is expected to be available. For example, 99.9% uptime means the service can only be down for around 9 hours per year.

• Response/resolution time: This specifies how quickly the service provider will respond to and resolve issues and outages. It could be measured in minutes, hours, or days.

• Credits/compensation: If the service provider fails to meet the agreed-upon SLA metrics, they may have to provide credits or compensation to the customer.

• Exclusions: Certain events beyond the service provider's control that may affect uptime are typically excluded from the SLA, like natural disasters, government actions, etc.

• Monitoring and reporting: The SLA will specify how uptime and performance will be monitored and reported to the customer.

• Dispute resolution process: In case of disputes regarding SLA compliance, the SLA defines how issues will be escalated and resolved.

Service Level Objective (SLO)

SLO stands for Service Level Objective. An SLO is a metric that specifies the target service level an organization aims to provide for a particular service aspect.

SLOs are often used in conjunction with SLAs. While an SLA defines the contractual obligations between a service provider and a customer, SLOs define internal targets that the service provider sets for itself to meet the SLA commitments.

Some key points about SLOs:

• SLOs are more specific and technical than SLAs. They focus on measurable aspects of a service.

• Common SLOs for software services include metrics like uptime percentage, error rate, latency, and throughput.

• SLOs have thresholds that define when the objective is met or missed. For example, an SLO for uptime could be "99.9% of requests will succeed within 500ms".

• SLOs help optimize operational and engineering decisions for the defined objectives.

• SLOs are often monitored continuously and reported internally. Any misses are investigated to determine the root cause.

• SLOs should be ambitious but also realistic and achievable given the current capabilities and resources.

• SLOs typically change and evolve as the service and capabilities improve.

Service Level Indicator (SLI)

SLI stands for Service Level Indicator. An SLI is a metric that is used to measure whether a Service Level Objective (SLO) is being met.

So while an SLO defines the target service level for a particular aspect of a service, SLIs are the concrete metrics that are tracked to determine if the SLO is being achieved.

Some examples of SLIs:

For an SLO of 99.9% uptime: • Number of requests succeeded • Number of requests failed

For an SLO of <500ms latency: • Average response time • Percentile response times (e.g. 95th percentile latency)

For an SLO of <1% error rate: • Number of errors • Error rate

SLIs have the following characteristics:

• They are technical, concrete, and measurable.

• They are tracked automatically using monitoring systems.

• They are closely tied to a particular SLO. Multiple SLIs may map to a single SLO.

• They have thresholds that determine whether the SLO is met or missed.

• They are reported regularly to determine if corrective action needs to be taken.

Conclusion

SLA defines the obligations of a service provider to deliver a defined level of service and specifies remedies if those obligations are not met. It helps set clear expectations and resolve issues that may arise during a business relationship.

SLAs focus on the customer-facing commitments, and SLOs help service providers implement the internal processes, tools, and engineering required to meet those commitments. They provide transparency and accountability for internal service optimization efforts.

SLIs help determine if the SLOs that an organization has set for its services and applications are being achieved in practice. Service teams can optimize their systems by monitoring the right SLIsto meet the defined objectives.