All posts

reliability

Error budgets explained (Google SRE in 10 minutes)

Uptimera team9 min read

Error budgets are Google SRE's most exportable idea. They translate an abstract reliability target into a concrete engineering decision: how much downtime are we allowed this month, and what do we do when we run out? Done right, they end the perennial argument about whether to ship faster or invest in reliability — the budget tells you. This post explains what they are, how to set one without overthinking it, and what policies actually make them useful.

What an error budget is, exactly

An error budget is the inverse of an SLO. If your SLO says "99.9% of requests succeed," your error budget is "0.1% of requests are allowed to fail." Expressed in time over a 30-day window, 99.9% availability is an error budget of ~43 minutes per month. That's your downtime allowance — spend it however you want.

The crucial property: it's a budget, not a target. You don't want to spend zero of it (that means you're too reliable, which means you're shipping too slowly). You don't want to spend more than the budget (that means you're unreliable enough to lose customers). The whole point is to operate somewhere in between.

Why budgets are better than "targets"

A reliability target like "99.9% uptime" is descriptive. An error budget is prescriptive — it tells you what to do. The difference:

  • Target: "We want 99.9% uptime." (Now what?)
  • Budget: "We have 43 minutes/month to spend. We spent 35 in week 1. Therefore, for weeks 2-4, we cannot risk more than 8 more minutes. Therefore, the planned database migration must happen in a maintenance window with a guaranteed rollback path."

Targets are about whether you hit the number at month-end. Budgets are about decision-making throughout the month.

Setting an error budget without overthinking it

Don't try to set budgets for every service simultaneously. Start with one: the most customer-facing surface (the API, the login flow, the checkout). Three decisions:

1. What's the SLI?

The thing you're measuring. "Successful HTTP requests (2xx or 3xx) / total non-cancelled requests." See our SLI/SLO/SLA guide for picking a good one.

2. What's the target (SLO)?

Start with something credible. For most SaaS:

  • 99.5% — internal tools, dashboards. ~3.5 hours/month budget.
  • 99.9% — customer-facing APIs, the standard for most B2B SaaS. ~43 minutes/month.
  • 99.95% — high-stakes surfaces (payments, auth). ~22 minutes/month.
  • 99.99% — only if you've already mastered 99.9%. ~4 minutes/month is a hard number.

Picking too high a number is the most common mistake. A team that's currently at 99.5% measured does not benefit from setting a 99.99% target — they'll exceed the budget every month and the policy will lose credibility.

3. What's the window?

30 days rolling is standard. Quarterly is fine for less critical services (smoother numbers, fewer false alarms). Hourly or daily is for very high-volume services where short windows still accumulate millions of requests.

The policy is the entire point

Setting an SLO with an implicit budget is the easy part. The hard part — the part that actually changes engineering behavior — is the policy. What happens at different levels of budget consumption?

A policy that works in practice for a 30-day window:

  • 0-50% consumed: Green. Ship freely. Take risks. Run chaos experiments. Lower-risk deploys can skip senior review.
  • 50-75% consumed: Yellow. Engineering work that improves reliability is prioritized over feature work. Deploys require explicit rollback plans. No experimental infrastructure changes.
  • 75-100% consumed: Orange. Feature work is paused unless approved by the product and engineering leads jointly. All deploys go through senior review.
  • 100%+ consumed: Red. Stop. No feature releases. Engineering effort is fully allocated to reliability improvements until budget recovers. Postmortems for everything that contributed.

What happens in reality

Three patterns from teams that have run error budgets for a year or more:

  • Most months, budget is underspent. Real teams operate at 99.95%-99.97% when targeting 99.9%. The budget is a ceiling, not a goal — and that's fine.
  • One month a quarter, budget gets tested. A bad deploy, a third-party outage, a rare bug. The policy kicks in for a week or two; reliability work gets prioritized; budget recovers.
  • Once a year, the policy gets challenged. A critical feature must ship despite Red status. This is where leadership earns credibility — either by holding the line (and shipping the feature next sprint) or by explicitly suspending the policy for a documented reason. Both are OK. What's not OK is pretending the budget doesn't exist.

Multiple SLOs: when and how

Once you have one error budget running well, expand. Common patterns at the second-budget stage:

  • Latency SLO alongside availability. "99% of requests complete in < 500ms." Catches the "up but slow" failure mode that pure availability misses.
  • Per-component SLOs when your app has obviously different reliability tiers. Login at 99.95%, dashboard at 99.9%, exports at 99.5%.
  • Per-tier SLOs for B2B customers on different plans. Enterprise contracts that specify 99.95%; free tier on best-effort. Different budgets for each.

Anti-patterns to avoid

  • Setting the SLO to match what you're already achieving. An SLO is a commitment to a tradeoff, not a description of current reality. Set it based on what customers need, not what the engineering team will hit without effort.
  • Calling product KPIs "SLOs." "Daily active users should be > 10k" is not an SLO. SLOs are about service behavior, not business outcomes.
  • Gaming the measurement. Excluding too many categories of failure ("customer misuse," "upstream issues," etc.) makes the number better but makes the budget meaningless. Your customers don't exclude these from their experience.

Where to go from here

Pick one service, one SLI, one SLO. Run it for a quarter. Decide the policy in advance, write it down, and follow it the first time the budget is threatened. The credibility comes from the first test — once the team sees the budget actually drive prioritization, it becomes a tool. Until then, it's just another number on a dashboard.