operations

On-Call Rotation: Set One Up Without Burnout

Uptimera teamMay 9, 202610 min readUpdated June 30, 2026

On-call is the engineering practice most likely to cause attrition. It is also unavoidable for any team running a real service. The gap between a humane on-call rotation and a soul-grinding one is almost entirely about policy choices the team makes before anyone gets paged. This post walks through those choices.

Why on-call quality is a retention problem

Two truths the industry doesn't talk about enough. First: engineers who interview elsewhere disproportionately cite on-call as the reason. Second: the engineers who stay through a bad on-call culture are the ones least likely to push back on it. By the time leadership realizes there's a problem, the people who would have raised it have left.

Treat on-call as an explicit operational program with policies, not as a thing engineering just "does." Below is the minimum viable set of policies.

Shift length: weekly is the sweet spot

The three common patterns and their tradeoffs:

Daily shifts. Lowest stress per shift, but the cognitive overhead of context-switching to on-call mode every day is high. Workable for very small teams (3-4 engineers) who'd otherwise rotate too rarely.
Weekly shifts. The standard for most teams. Long enough that the on-call engineer gets familiar with the current state of the system; short enough that one bad week doesn't consume someone's month. Strongly preferred.
Two-week shifts. Common at large companies because rotation math is easier. Brutal on the individual. Avoid unless you have an unusually large rotation (15+ engineers) where it's the only way to keep frequency tolerable.

Primary and secondary, always

Solo on-call is anti-pattern. There must always be a backup. The secondary doesn't need to be at the keyboard within a minute — their job is:

Take over if the primary doesn't respond within X minutes (typical: 15 minutes for SEV1, 30 for SEV2).
Provide a second brain during long incidents. The primary debugs; the secondary watches for things the primary missed, coordinates with other teams, posts status updates.
Cover transient unavailability — the primary lost cell signal, is driving, etc.

A common mistake: making the secondary the same person every week (typically the tech lead). This burns one person out instead of two. Rotate the secondary independently of the primary.

The handoff: a written, 10-minute meeting

Every shift change needs a structured handoff. Five questions the outgoing on-call answers, in writing, that the incoming on-call reads and acknowledges:

What broke this week? Brief list with links to alerts/Slack threads.
What is currently degraded or unresolved? Anything ongoing.
What deploys happened or are planned? Particularly database migrations, infra changes, and dependency upgrades.
What alerts are noisy and need tuning? Flag them so the new on-call doesn't re-page on the same thing.
Anything to watch for? Customer escalations, planned outages on dependencies, etc.

This usually fits in a Slack message. A weekly 10-minute live handoff (sync, not async) is worth scheduling — it catches the things people forget to write down.

Response-time targets that actually move

The industry standard reliability metrics — MTTD, MTTA, and MTTR — are largely on-call KPIs. If your MTTA (time from page to human acknowledgment) is over 15 minutes at night, the rotation policy needs work before the alerting policy does. After each incident, capture what happened via the postmortem template so the fixes flow back into the next rotation instead of dying in Slack.

A clear severity ladder

Don't let "everything is urgent" happen. A three-tier ladder — explained in full in incident severity levels — is usually enough:

SEV1 — Critical. Service is down or severely degraded for most users. Page primary immediately. Escalate to secondary in 15 minutes if no ack. Engineering manager looped in.
SEV2 — Major. A subset of users or a non-critical feature is affected. Page primary during business hours; quieter alerting overnight (Slack message, no SMS) unless escalated.
SEV3 — Minor. Something to look at but not urgent. No page; create a ticket. Should never wake anyone up.

Define the boundary between SEV2 and SEV3 carefully: it's where alert fatigue starts. If your team thinks of half of SEV2 as "could have been SEV3," you have a tuning problem — see our alert fatigue post.

Compensation: pay for it

Carrying a pager is work. Treating it as "just part of the job" is how teams accumulate quiet resentment. The cleanest models:

Stipend per shift. A flat amount for being on-call (e.g. $200/week). Pays for the burden of carrying the phone whether or not you're paged.
Per-page compensation. Additional amount per actual page outside business hours. Aligns incentives — the team gets paid more when alerting is noisy, which gets attention faster.
Time-in-lieu. Comp time for hours spent dealing with overnight pages. Less clean financially but works in regions where stipends raise legal questions.

Escalation paths must be clear

At 3am, the on-call should never have to wonder who to call next. The escalation chain — written, in the runbook — looks like:

Primary on-call (15 min ack)
Secondary on-call (15 min ack)
Engineering manager (30 min ack)
Director / on-call lead (60 min ack)
VP / CTO (only for SEV1 not resolving in 2+ hours)

Two rules. First: every level above primary must have explicit opt-in. Don't volunteer your manager without asking. Second: skipping levels is allowed for serious incidents. A SEV1 affecting most customers should page everyone simultaneously, not wait for 15 minutes per tier.

Onboarding new engineers to the rotation

Don't throw new hires into the pager. A reasonable ramp:

Weeks 0-4: Shadow only. Get paged into nothing. Read past postmortems. Run the runbooks against a staging environment to make sure they work.
Weeks 4-8: Secondary on-call. Always paired with a senior primary. Gets all the pages but isn't responsible for resolving them solo.
Weeks 8+: Primary on-call, with the team's explicit confidence they're ready. Senior secondary for the first few shifts.

Tooling: what you need and don't

For teams up to about 20 engineers, the stack is:

A paging tool. PagerDuty, Opsgenie, Better Stack — all fine. Don't roll your own.
A monitoring system that can route to the paging tool via signed webhooks. Don't use a monitoring system without integration — re-implementing escalation badly is a classic mistake.
A shared inbox / Slack channel for alerts that should be visible but not page anyone. The kitchen sink of "something looks off" goes here.
A runbook system. Could be Notion, Confluence, a Git repo. The format matters less than the discipline to keep runbooks current. See our runbooks template.

Where to go from here

Audit your current rotation against the checklist: shift length, primary/secondary, severity tiers, escalation chain, compensation, onboarding. Almost every team has at least one gap. Closing two of them in the next quarter is more impactful than any reliability engineering project — because it's what lets the engineers doing reliability engineering stay healthy long enough to do it.

Uptimera team

We build Uptimera — multi-region uptime monitoring, SSL and DNS checks, and branded status pages. These guides come from running the same monitoring and on-call practices we write about.