All posts

reliability

Error Budget Policy Template (with Example)

Uptimera team8 min read

An error budget without a written policy is a wishbone: everyone makes a quiet wish about what should happen when reliability slips, and nobody agrees. You only find out the wishes diverge during an outage, which is the worst possible time to negotiate. This post is the fix. It's a copy-paste error budget policy you can adopt this quarter — thresholds, actions, owners, and an exceptions clause — plus notes on how to tune and enforce it. If you want the concept first, read error budgets explained; this post assumes you're past that and want the template.

What an error budget policy is

An error budget policy is a short, pre-agreed document that says what the team does at each level of budget consumption, and who gets to decide. The budget itself is just a number — the inverse of your SLO, expressed as allowed downtime (99.9% over 30 days is ~43 minutes). The policy is what turns that number into behavior: it maps "we've burned 80% of the budget" to a concrete change in how you deploy, what you build, and who has to approve it. Without the policy, the budget is a dashboard tile nobody acts on. If the SLO/SLI/SLA vocabulary is fuzzy, our SLA vs SLO vs SLI guide clears it up in a page.

The error budget policy template

Here it is. Copy it into a doc in your repo, replace the bracketed values, and delete anything that doesn't apply. It's deliberately short — a policy nobody reads is a policy nobody follows.

# Error Budget Policy — [Service Name]

## Scope
- Service: [checkout-api]
- SLI: successful requests / total non-cancelled requests
       (success = HTTP 2xx or 3xx, measured at the load balancer)
- SLO: 99.9% over a rolling 30-day window
- Error budget: 0.1% of requests, ≈ 43 minutes of downtime / 30 days
- Owner (status changes): [Eng lead, checkout team]
- Approver (exceptions): [Product lead] + [Eng lead], jointly

## Budget states (% of the 43-minute monthly budget consumed)

GREEN — 0-50% consumed
  - Ship freely. Normal review process.
  - Chaos experiments and game days are allowed and encouraged.
  - Risky-but-reversible changes are fine during business hours.

YELLOW — 50-75% consumed
  - Reliability work is prioritized ahead of new feature work in
    planning.
  - Every deploy must have a written, tested rollback plan.
  - No experimental infrastructure changes (new DB, new region, etc.).

ORANGE — 75-100% consumed
  - Feature work is PAUSED unless the Product lead AND Eng lead
    approve the specific change, in writing.
  - All deploys go through senior review.
  - A reliability workstream is stood up to recover budget.

RED — over 100% consumed (budget exhausted)
  - Feature releases are FROZEN. No exceptions without the
    exceptions clause below.
  - All available engineering effort goes to reliability until the
    rolling budget is back under 100%.
  - A postmortem is required for every incident that contributed to
    the overspend.

## Who decides
- The Owner declares the current state, based on the live budget
  dashboard, at the start of each week and whenever it crosses a
  threshold mid-week.
- The state is posted where the team already looks (the team
  channel + the sprint board header).

## Exceptions clause
- The freeze may be suspended for a single named change when a
  documented business reason justifies the risk (e.g. a contractual
  deadline, a security fix, a revenue-critical launch).
- A suspension requires written sign-off from the Approver(s), a
  stated reason, and an expiry (max 1 sprint).
- Every suspension is logged in this doc's changelog, in the open.
- Suspending the policy is allowed. Ignoring it is not.

## Review
- Revisit the SLO and thresholds once per quarter, using the last
  90 days of actual burn.

Notice what the exceptions clause does: it makes suspension a legitimate, visible move rather than a quiet override. Leadership can say "we're shipping despite Red, here's why, here's who signed it, it expires in one sprint" — and the policy survives. It's the undocumented override that kills it.

A few things people leave out and regret. Spell out the SLI measurement point — "at the load balancer" versus "from a synthetic probe" produces different numbers, and the argument about which one counts always happens mid-incident if you don't settle it in advance. State the window explicitly (a rolling 30 days behaves very differently from a calendar month that resets on the 1st). And name what's excluded: scheduled maintenance windows, if any, and clearly customer-side failures. Keep exclusions short — every category you carve out makes the number prettier and the budget less honest, and your customers don't exclude those minutes from their experience.

Setting the thresholds

The 50/75/100% bands aren't sacred, but they're sensible defaults, and here's the reasoning so you can tune them for your team.

  • 50% (green→yellow) is the halfway warning. You're on pace to spend the whole budget with two weeks left. That's not an emergency, but it's the moment to stop taking gratuitous risk. Set it lower (say 40%) if your incidents tend to arrive in clusters and you want more runway.
  • 75% (yellow→orange) is where feature velocity should visibly bend. A quarter of the budget left is not much — one bad deploy can burn it. Moving this to 66% is reasonable for high-stakes surfaces like payments or auth.
  • 100% (orange→red) is the hard line. The budget is gone. If you've set your SLO honestly, hitting Red should be rare — a few times a year at most. If you hit Red every single month, the problem is almost never the policy; it's that the SLO is set higher than the service can currently deliver. Lower the target, earn it back, then raise it.

Two knobs to consider beyond the bands. First, add a burn-rate alert on top of the consumption bands: burning budget 10x faster than sustainable over the last hour means something is actively on fire right now, regardless of how much budget remains. Second, pick the window deliberately — 30-day rolling is the default, but if your numbers whipsaw, a longer window smooths them. The exact downtime each target buys is worked out in 99.9% vs 99.99% explained if you want the minutes-per-window table.

Who decides and how to enforce it

The policy is theater unless someone holds the line. Two roles make or break it. The Owner — usually the service's engineering lead — declares the current state off the live dashboard and posts it where the team already looks. This has to be a named person, not "the team," or the state never gets declared and the policy quietly lapses. The Approver — product and engineering leads jointly — is the only one who can wave a change through Orange or suspend a Red freeze.

Enforcement doesn't require a heavy tool. It requires the state to be visible and the two roles to actually play their part. A budget dashboard, a channel post at the start of each week, and a one-line sign-off in the doc when someone invokes the exceptions clause is the entire machinery. The failure mode is never "we didn't have the right software." It's "nobody wanted to be the person who said no to the launch."

How to roll it out

Do not roll this out across every service at once. That's how you end up with twelve policies nobody follows. Start with one service and one SLO — the most customer-facing surface you have (the API, login, checkout) — and run it for a full quarter.

  • Week 1: fill in the template, agree the SLO, and get the Owner and Approver named out loud in a meeting. Put the doc in the repo.
  • Weeks 2-12: declare the state weekly, even when it's boring green. The habit of declaring is what you're building, not the drama.
  • The first time budget is tested — a bad deploy, a third-party outage — run the policy exactly as written. This is the trial. If Yellow or Orange actually reorders the sprint, the team learns the budget has teeth.

Only expand after that first real test. Add a latency SLO to catch the "up but slow" failures, split budgets per component, or set per-tier budgets for different customer plans. If your rollout includes deliberately spending budget to learn — a game day, a failover drill — the same green-state permission that allows chaos engineering on a small budget is exactly the mechanism that makes those experiments safe to schedule.

The credibility test

A policy earns its authority the first time it costs someone something. Until the budget has actually paused a feature, delayed a launch, or reordered a sprint, it's just a number on a dashboard that everyone nods at and ignores. The moment the team watches Orange genuinely block work — and watches leadership either hold the line or suspend the freeze in the open, with a reason — the policy becomes a tool the team trusts. Write it down this quarter, pick one service, and follow it the first time it bites. That first enforcement is the entire point.

U

Uptimera team

We build Uptimera — multi-region uptime monitoring, SSL and DNS checks, and branded status pages. These guides come from running the same monitoring and on-call practices we write about.