operations
How to set up an on-call rotation without burning out your team
On-call is the engineering practice most likely to cause attrition. It is also unavoidable for any team running a real service. The gap between a humane on-call rotation and a soul-grinding one is almost entirely about policy choices the team makes before anyone gets paged. This post walks through those choices.
Why on-call quality is a retention problem
Two truths the industry doesn't talk about enough. First: engineers who interview elsewhere disproportionately cite on-call as the reason. Second: the engineers who stay through a bad on-call culture are the ones least likely to push back on it. By the time leadership realizes there's a problem, the people who would have raised it have left.
Treat on-call as an explicit operational program with policies, not as a thing engineering just "does." Below is the minimum viable set of policies.
Shift length: weekly is the sweet spot
The three common patterns and their tradeoffs:
- Daily shifts. Lowest stress per shift, but the cognitive overhead of context-switching to on-call mode every day is high. Workable for very small teams (3-4 engineers) who'd otherwise rotate too rarely.
- Weekly shifts. The standard for most teams. Long enough that the on-call engineer gets familiar with the current state of the system; short enough that one bad week doesn't consume someone's month. Strongly preferred.
- Two-week shifts. Common at large companies because rotation math is easier. Brutal on the individual. Avoid unless you have an unusually large rotation (15+ engineers) where it's the only way to keep frequency tolerable.
Primary and secondary, always
Solo on-call is anti-pattern. There must always be a backup. The secondary doesn't need to be at the keyboard within a minute — their job is:
- Take over if the primary doesn't respond within X minutes (typical: 15 minutes for SEV1, 30 for SEV2).
- Provide a second brain during long incidents. The primary debugs; the secondary watches for things the primary missed, coordinates with other teams, posts status updates.
- Cover transient unavailability — the primary lost cell signal, is driving, etc.
A common mistake: making the secondary the same person every week (typically the tech lead). This burns one person out instead of two. Rotate the secondary independently of the primary.
The handoff: a written, 10-minute meeting
Every shift change needs a structured handoff. Five questions the outgoing on-call answers, in writing, that the incoming on-call reads and acknowledges:
- What broke this week? Brief list with links to alerts/Slack threads.
- What is currently degraded or unresolved? Anything ongoing.
- What deploys happened or are planned? Particularly database migrations, infra changes, and dependency upgrades.
- What alerts are noisy and need tuning? Flag them so the new on-call doesn't re-page on the same thing.
- Anything to watch for? Customer escalations, planned outages on dependencies, etc.
This usually fits in a Slack message. A weekly 10-minute live handoff (sync, not async) is worth scheduling — it catches the things people forget to write down.
A clear severity ladder
Don't let "everything is urgent" happen. A three-tier ladder is usually enough:
- SEV1 — Critical. Service is down or severely degraded for most users. Page primary immediately. Escalate to secondary in 15 minutes if no ack. Engineering manager looped in.
- SEV2 — Major. A subset of users or a non-critical feature is affected. Page primary during business hours; quieter alerting overnight (Slack message, no SMS) unless escalated.
- SEV3 — Minor. Something to look at but not urgent. No page; create a ticket. Should never wake anyone up.
Define the boundary between SEV2 and SEV3 carefully: it's where alert fatigue starts. If your team thinks of half of SEV2 as "could have been SEV3," you have a tuning problem — see our alert fatigue post.
Compensation: pay for it
Carrying a pager is work. Treating it as "just part of the job" is how teams accumulate quiet resentment. The cleanest models:
- Stipend per shift. A flat amount for being on-call (e.g. $200/week). Pays for the burden of carrying the phone whether or not you're paged.
- Per-page compensation. Additional amount per actual page outside business hours. Aligns incentives — the team gets paid more when alerting is noisy, which gets attention faster.
- Time-in-lieu. Comp time for hours spent dealing with overnight pages. Less clean financially but works in regions where stipends raise legal questions.
Escalation paths must be clear
At 3am, the on-call should never have to wonder who to call next. The escalation chain — written, in the runbook — looks like:
- Primary on-call (15 min ack)
- Secondary on-call (15 min ack)
- Engineering manager (30 min ack)
- Director / on-call lead (60 min ack)
- VP / CTO (only for SEV1 not resolving in 2+ hours)
Two rules. First: every level above primary must have explicit opt-in. Don't volunteer your manager without asking. Second: skipping levels is allowed for serious incidents. A SEV1 affecting most customers should page everyone simultaneously, not wait for 15 minutes per tier.
Onboarding new engineers to the rotation
Don't throw new hires into the pager. A reasonable ramp:
- Weeks 0-4: Shadow only. Get paged into nothing. Read past postmortems. Run the runbooks against a staging environment to make sure they work.
- Weeks 4-8: Secondary on-call. Always paired with a senior primary. Gets all the pages but isn't responsible for resolving them solo.
- Weeks 8+: Primary on-call, with the team's explicit confidence they're ready. Senior secondary for the first few shifts.
Tooling: what you need and don't
For teams up to about 20 engineers, the stack is:
- A paging tool. PagerDuty, Opsgenie, Better Stack — all fine. Don't roll your own.
- A monitoring system that can route to the paging tool via signed webhooks. Don't use a monitoring system without integration — re-implementing escalation badly is a classic mistake.
- A shared inbox / Slack channel for alerts that should be visible but not page anyone. The kitchen sink of "something looks off" goes here.
- A runbook system. Could be Notion, Confluence, a Git repo. The format matters less than the discipline to keep runbooks current. See our runbooks template.
Where to go from here
Audit your current rotation against the checklist: shift length, primary/secondary, severity tiers, escalation chain, compensation, onboarding. Almost every team has at least one gap. Closing two of them in the next quarter is more impactful than any reliability engineering project — because it's what lets the engineers doing reliability engineering stay healthy long enough to do it.