All posts

operations

How to set up an on-call rotation without burning out your team

Uptimera team10 min read

On-call is the engineering practice most likely to cause attrition. It is also unavoidable for any team running a real service. The gap between a humane on-call rotation and a soul-grinding one is almost entirely about policy choices the team makes before anyone gets paged. This post walks through those choices.

Why on-call quality is a retention problem

Two truths the industry doesn't talk about enough. First: engineers who interview elsewhere disproportionately cite on-call as the reason. Second: the engineers who stay through a bad on-call culture are the ones least likely to push back on it. By the time leadership realizes there's a problem, the people who would have raised it have left.

Treat on-call as an explicit operational program with policies, not as a thing engineering just "does." Below is the minimum viable set of policies.

Shift length: weekly is the sweet spot

The three common patterns and their tradeoffs:

  • Daily shifts. Lowest stress per shift, but the cognitive overhead of context-switching to on-call mode every day is high. Workable for very small teams (3-4 engineers) who'd otherwise rotate too rarely.
  • Weekly shifts. The standard for most teams. Long enough that the on-call engineer gets familiar with the current state of the system; short enough that one bad week doesn't consume someone's month. Strongly preferred.
  • Two-week shifts. Common at large companies because rotation math is easier. Brutal on the individual. Avoid unless you have an unusually large rotation (15+ engineers) where it's the only way to keep frequency tolerable.

Primary and secondary, always

Solo on-call is anti-pattern. There must always be a backup. The secondary doesn't need to be at the keyboard within a minute — their job is:

  • Take over if the primary doesn't respond within X minutes (typical: 15 minutes for SEV1, 30 for SEV2).
  • Provide a second brain during long incidents. The primary debugs; the secondary watches for things the primary missed, coordinates with other teams, posts status updates.
  • Cover transient unavailability — the primary lost cell signal, is driving, etc.

A common mistake: making the secondary the same person every week (typically the tech lead). This burns one person out instead of two. Rotate the secondary independently of the primary.

The handoff: a written, 10-minute meeting

Every shift change needs a structured handoff. Five questions the outgoing on-call answers, in writing, that the incoming on-call reads and acknowledges:

  • What broke this week? Brief list with links to alerts/Slack threads.
  • What is currently degraded or unresolved? Anything ongoing.
  • What deploys happened or are planned? Particularly database migrations, infra changes, and dependency upgrades.
  • What alerts are noisy and need tuning? Flag them so the new on-call doesn't re-page on the same thing.
  • Anything to watch for? Customer escalations, planned outages on dependencies, etc.

This usually fits in a Slack message. A weekly 10-minute live handoff (sync, not async) is worth scheduling — it catches the things people forget to write down.

A clear severity ladder

Don't let "everything is urgent" happen. A three-tier ladder is usually enough:

  • SEV1 — Critical. Service is down or severely degraded for most users. Page primary immediately. Escalate to secondary in 15 minutes if no ack. Engineering manager looped in.
  • SEV2 — Major. A subset of users or a non-critical feature is affected. Page primary during business hours; quieter alerting overnight (Slack message, no SMS) unless escalated.
  • SEV3 — Minor. Something to look at but not urgent. No page; create a ticket. Should never wake anyone up.

Define the boundary between SEV2 and SEV3 carefully: it's where alert fatigue starts. If your team thinks of half of SEV2 as "could have been SEV3," you have a tuning problem — see our alert fatigue post.

Compensation: pay for it

Carrying a pager is work. Treating it as "just part of the job" is how teams accumulate quiet resentment. The cleanest models:

  • Stipend per shift. A flat amount for being on-call (e.g. $200/week). Pays for the burden of carrying the phone whether or not you're paged.
  • Per-page compensation. Additional amount per actual page outside business hours. Aligns incentives — the team gets paid more when alerting is noisy, which gets attention faster.
  • Time-in-lieu. Comp time for hours spent dealing with overnight pages. Less clean financially but works in regions where stipends raise legal questions.

Escalation paths must be clear

At 3am, the on-call should never have to wonder who to call next. The escalation chain — written, in the runbook — looks like:

  • Primary on-call (15 min ack)
  • Secondary on-call (15 min ack)
  • Engineering manager (30 min ack)
  • Director / on-call lead (60 min ack)
  • VP / CTO (only for SEV1 not resolving in 2+ hours)

Two rules. First: every level above primary must have explicit opt-in. Don't volunteer your manager without asking. Second: skipping levels is allowed for serious incidents. A SEV1 affecting most customers should page everyone simultaneously, not wait for 15 minutes per tier.

Onboarding new engineers to the rotation

Don't throw new hires into the pager. A reasonable ramp:

  • Weeks 0-4: Shadow only. Get paged into nothing. Read past postmortems. Run the runbooks against a staging environment to make sure they work.
  • Weeks 4-8: Secondary on-call. Always paired with a senior primary. Gets all the pages but isn't responsible for resolving them solo.
  • Weeks 8+: Primary on-call, with the team's explicit confidence they're ready. Senior secondary for the first few shifts.

Tooling: what you need and don't

For teams up to about 20 engineers, the stack is:

  • A paging tool. PagerDuty, Opsgenie, Better Stack — all fine. Don't roll your own.
  • A monitoring system that can route to the paging tool via signed webhooks. Don't use a monitoring system without integration — re-implementing escalation badly is a classic mistake.
  • A shared inbox / Slack channel for alerts that should be visible but not page anyone. The kitchen sink of "something looks off" goes here.
  • A runbook system. Could be Notion, Confluence, a Git repo. The format matters less than the discipline to keep runbooks current. See our runbooks template.

Where to go from here

Audit your current rotation against the checklist: shift length, primary/secondary, severity tiers, escalation chain, compensation, onboarding. Almost every team has at least one gap. Closing two of them in the next quarter is more impactful than any reliability engineering project — because it's what lets the engineers doing reliability engineering stay healthy long enough to do it.