All posts

operations

Runbooks: a template on-call engineers will actually use

Uptimera team9 min read

Most runbooks rot the moment they're written. They're too long, too generic, and date-stamp themselves by referring to infrastructure that's been replaced twice since the doc was last touched. This post is about writing runbooks that survive contact with a real 3am page — short enough to actually read, specific enough to actually use.

What a runbook is for (and isn't)

A runbook is a one-page document that gets a tired on-call engineer from "something is wrong" to "I've contained the impact" in under 10 minutes. It is not:

  • A design doc.
  • A complete architectural overview.
  • The single source of truth about how the service works.
  • A training manual for new hires.

Those are other documents. The runbook's only audience is the on-call engineer who just got paged for this specific alert.

When to write one

Don't pre-write runbooks for every possible alert. Most of them never fire and the doc becomes stale. Write a runbook when:

  • The same incident happens twice. The second time you fixed it, write down what you did. The third time someone hits it, they should not need to dig through Slack history.
  • The fix involves more than two steps that aren't obvious from the alert text.
  • You're launching new infrastructure with a non-trivial recovery procedure. Database failover, queue migration, secret rotation — situations where the on-call won't have time to learn on the fly.

Don't write runbooks for situations that won't happen again because you've fixed the underlying issue. Document the fix in the postmortem; close the runbook gap by deleting the bug, not by writing a recovery doc.

The one-page template

Everything below fits on one screen. If it doesn't, you're writing the wrong document.

# [Alert name] — Runbook
Last reviewed: 2026-04-12  •  Owner: @team-or-person

## What this alert means
One sentence. What is broken / what fired.

## Impact (if not addressed)
One sentence. What customers see / what data is at risk.

## Quick checks (60 seconds)
- Link to the relevant dashboard
- Link to the relevant log query
- Anything that disambiguates this alert from related ones

## Mitigation (do this first)
1. Step that stops the bleeding (e.g. rollback last deploy, drain node, increase rate limit)
2. Verification step (e.g. dashboard X should show Y within 2 minutes)
3. Stop here if the impact is contained.

## Root cause investigation
- Where to look next (log query, trace query, specific component)
- Common causes (with one-line description each)
- Escalation contact if stuck after 30 minutes

## Notes
- Anything counterintuitive
- Recent changes to this runbook

Anti-patterns that kill runbooks

The wall of text

A runbook that takes 15 minutes to read is a runbook nobody reads. The 3am on-call will skim the first paragraph, give up, and either fix it from memory (works for incidents they've seen before) or page senior engineering (which the runbook was supposed to prevent).

The shopping list of links

A runbook that's "Step 1: look at this dashboard. Step 2: look at this other dashboard. Step 3: look at this log query." — without telling the reader what to do with what they see. Pages tell you what to look at; runbooks tell you what to look for and what to do when you see it.

Mitigation buried after investigation

Mitigation comes first. Always. Stop the bleeding, then diagnose. A runbook that walks through five debugging steps before getting to "just roll back the deploy" will get someone to page the wrong person while customers are suffering.

Commands that are 12 months old

A runbook that includes a command like kubectl exec -it pod-abc-123 -- ... with a hardcoded pod name. The pod was replaced an hour after the runbook was written. Use parameterized examples or the deploy tool's canonical commands, not screenshots of someone's working shell.

Where to store them

Three options, in order of preference:

  • In the alert itself. Most modern paging systems (PagerDuty, Opsgenie) let you attach a runbook URL to each alert. The on-call clicks it straight from the page. Highest discoverability.
  • In a Git repo next to the code. A runbooks/ directory at the service root, markdown files per alert. Edits go through code review, so the runbook stays close to the code that triggers it.
  • In Notion or Confluence. Acceptable if your team uses it heavily. Risk: search rots faster than Git history; pages move; the URL in the alert link breaks.

Keeping them current

Stale runbooks are worse than missing runbooks. A doc that tells you to run a command that doesn't exist anymore actively misleads. Three practices that keep them honest:

  • Touch the runbook every time you use it. Add a one-line note: "Used 2026-04-12, works as written." or "Used 2026-04-12 — step 3 no longer applies, removed." Compounds: every page makes the runbook better for the next reader.
  • Quarterly review. Top 5 most-referenced runbooks get re-read by the team. Anything stale gets fixed or deleted.
  • Add the "last reviewed" date to the top. A runbook last reviewed 18 months ago needs a once-over before you trust it. The date signals whether to read it confidently or suspiciously.

A worked example

Concretely, what a runbook looks like for a real alert. Imagine the alert is "database connection pool exhausted":

# DB pool exhausted — Runbook
Last reviewed: 2026-04-12  •  Owner: @platform

## What this alert means
The application can't get a database connection from the pool. New requests are failing.

## Impact (if not addressed)
Customer-facing 503s. Within 2 minutes, sign-ups, checkouts, and dashboard loads all fail.

## Quick checks (60 seconds)
- Dashboard: Grafana > "DB connections" — is total connections at the cap?
- Log query: kubectl logs deploy/api --tail=50 | grep "pool exhausted"
- Is there a slow query running? Check pg_stat_activity in the DB.

## Mitigation (do this first)
1. Increase pool size: kubectl set env deploy/api DB_POOL_SIZE=40 (current is 20)
2. Watch the dashboard — connection count should normalize within 60 seconds.
3. If it doesn't, restart the API pods: kubectl rollout restart deploy/api
4. Stop here if impact is contained.

## Root cause investigation
- Long-running query? Check the slow query log.
- Recent deploy that introduced a connection leak? Check git log on the API repo.
- Spike in legitimate traffic? Check traffic dashboard.

## Escalation
If pool still exhausted after step 3, page @db-team via Slack /pagerduty trigger.

## Notes
- DB_POOL_SIZE is capped at 50 (Postgres max_connections is 100, leaving headroom for migrations).
- This alert fired 3x in March 2026; all 3 were caused by a specific report query — see PR #4521.

Where to go from here

Look at your most recent five paged incidents. For each: was there a runbook? If yes, did it help? If no, is this alert one of those repeating ones worth documenting? Write the highest-value runbook this week. Don't try to cover everything — coverage comes from incidents repeating, and runbooks should follow that signal.