All posts

fundamentals

SLA vs SLO vs SLI: a practical guide for engineers

Uptimera team9 min read

SLA, SLO, and SLI sound like three flavors of the same thing. They aren't. They are three rungs of a single ladder, and most reliability conversations get confused because someone mixes them up. Once you can place each one on the ladder you can read any vendor contract or SRE book without translating in your head.

The three words in one paragraph

An SLI is a number you measure — e.g. "99.93% of homepage requests returned a 2xx in the last 30 days." An SLO is the target you set for that number — e.g. "the homepage SLI should be ≥ 99.9% over any 30-day window." An SLA is a contract that says what happens to you if the SLO is missed — e.g. "if homepage availability drops below 99.5% in a month, the customer gets a 10% credit." SLI is measurement, SLO is intent, SLA is a promise with consequences.

SLI: pick something a customer would notice

The biggest mistake teams make at the SLI stage is choosing metrics that are easy to measure rather than metrics that matter. CPU utilization is easy. "Login success rate" is what an actual user cares about.

A good SLI has three properties:

  • It expresses something a user experiences. Latency of the checkout API, success rate of file uploads, freshness of the most recent dashboard data.
  • It's a ratio with a clear numerator and denominator. "Good events / total events", where you can defend what counts as "good."
  • It's measurable continuously, not just during incidents. A metric you only collect when things break is useless for setting a target.

Examples of useful SLIs

  • Availability: count of requests returning 2xx or 3xx / total non-cancelled requests, over a 5-minute window.
  • Latency: count of requests completed within 500ms / total completed requests.
  • Freshness: count of dashboard loads where the most recent data is < 1 minute old / total dashboard loads.
  • Correctness: count of payment webhooks processed without retries / total webhooks received.

SLO: a target, an error budget, and a policy

An SLO is three things bundled together. The first is obvious — a number, e.g. 99.9%. The second is the error budget: the inverse of the SLO, expressed as time you are allowed to be down. 99.9% over 30 days is an error budget of ~43 minutes per month. The third — and the one most teams skip — is the policy: what happens when you spend the budget? Common policies:

  • Spend < 50% of budget: ship freely, take risks.
  • Spend 50–100%: all engineering work must improve reliability or be explicitly blessed.
  • Spend > 100%: halt feature work until budget recovers. No new releases except fixes.

Without the policy, an SLO is a wishbone. With it, you have a mechanism that automatically allocates engineering time to reliability the moment it starts to slip.

SLA: the contract version, with money attached

An SLA is what you put in a contract. It typically includes:

  • The target (e.g. 99.9% monthly availability).
  • The measurement window(calendar month, rolling 30 days).
  • What counts as downtime and what doesn't (planned maintenance, force majeure, customer-caused outages — see our SLA math post for the typical exclusions).
  • The remedy when targets are missed — usually a service credit on the next invoice, capped at some percentage of the monthly fee.

Two SLA realities to know. First: service credits are almost always capped (often at 25-50% of one month's spend). They're a gesture, not real compensation. Second: SLAs are reactive — they're what you owe the customer after the bad month. SLOs are what your team chases to avoid the bad month in the first place.

Who in your org cares about each one

  • SLI is engineering's internal tool. It lives in dashboards and Grafana boards.
  • SLO is a leadership decision. It connects engineering to product priorities. Setting an SLO is committing to a tradeoff between speed and reliability.
  • SLA is a legal/sales artifact. Engineering helps draft it, but it lives in contracts and procurement reviews.

Quick reference table

If you remember nothing else from this post:

  • SLI — what is true (a number you measure)
  • SLO — what should be true (a target with a budget and a policy)
  • SLA — what happens if it isn't (a contract clause with a remedy)

If you don't yet have any of the three, start with one SLI for the most customer-visible surface and one SLO for it. The SLA conversation will come up when you sell to enterprise — by then you'll have data to defend whatever number you commit to.