operations
How to write an incident postmortem (with template)
A good postmortem is the cheapest insurance you can buy against repeating the same outage. A bad one — vague, defensive, or skipped — guarantees you'll see the same incident again, often with worse timing.
This is the template we use at Uptimera, with the rationale for every section. Steal it. The format matters less than the discipline of actually filling it in within 48 hours of resolution.
Three principles before the template
Before you fill out a postmortem, the team should agree on the operating posture:
- Blame the system, not the person. If a single engineer's mistake could take production down, that's a system problem — missing guardrails, missing review, missing automation. Always.
- Optimize for learning, not blame. Read How Complex Systems Fail by Richard Cook (it's 4 pages). Real outages are almost never one root cause; they're a chain of small contributing factors that finally lined up.
- Ship the action items. A postmortem with no follow-up actions is a journal entry, not a postmortem. Every action item gets an owner and a due date.
The template
Copy the headers verbatim. The sequence matters — it forces facts first, interpretation second, action third.
# Incident — <one-line summary> **Date:** YYYY-MM-DD **Duration:** Hh Mm **Severity:** SEV-1 / SEV-2 / SEV-3 **Customer impact:** <plain English> **Status:** Resolved · postmortem complete · actions tracked --- ## Summary One paragraph. What happened, who was affected, when it started, when it resolved. This is what an exec or customer sees first. ## Timeline (UTC) - 14:02 Deploy of #1234 starts - 14:04 Error rate climbs to 12% on /checkout - 14:07 PagerDuty alert fires; on-call ack - 14:11 Rollback initiated - 14:18 Error rate returns to baseline - 14:25 Incident declared resolved ## Impact - Affected service: <which surface> - Affected users: <count or %> - Failed requests: <count> - Revenue impact: <if known> - SLA impact: <minutes burned this month> ## Detection How did we find out? Alert? Customer? Twitter? Was the alert noisy, timely, actionable? If detection was slow, that's an action item. ## Root cause(s) What actually broke. Be specific. "A bug" is not a root cause; "the new code path treated null as 0 in the line item total" is. ## Contributing factors What else was lining up that made this worse than it should have been? Examples: missing test coverage, no canary deploy, alert was muted, new on-call had no runbook. ## What went well Yes, really. The team needs to know what to keep doing. ## What didn't The honest list. No sandbagging. ## Action items | # | Action | Owner | Due | Type | |---|--------|-------|-----|------| | 1 | Add null-guard in totals helper | @alice | 2026-MM-DD | Fix | | 2 | Add canary deploy step for /checkout | @bob | 2026-MM-DD | Prevent | | 3 | Document rollback in runbook | @carol | 2026-MM-DD | Detect | Types: Fix (closes this gap), Prevent (stops similar in future), Detect (we'd catch it faster next time), Mitigate (smaller blast radius).
The five questions that drive real learning
Postmortems get hollow when the meeting becomes a status update. These five questions, asked in order, surface the lessons that matter:
1. What did we believe about the system that turned out to be wrong?
Almost every outage has a hidden assumption at its core. "We thought the timeout was 5s; it was 30s." "We thought this endpoint was idempotent; it wasn't." Naming the wrong belief is the postmortem's most valuable artifact.
2. What information would have changed the outcome?
Not what would have prevented it — what would have made the responder act sooner or differently. This question generates observability action items.
3. Where was the closest near-miss recently?
Often the same failure mode caused a smaller alert a week earlier that got dismissed. Asking this out loud finds it.
4. What would have made the rollback faster?
Most production outages are bounded by mean time to recovery, not by the bug itself. Rollback speed is a feature.
5. What would a competent stranger need to know to handle this next time?
The answer is your runbook. Write it now, while the context is fresh.
A simple severity scale
Pick a scale and stick to it. Ours:
- SEV-1: Production is down or severely degraded for most users. Pages on-call immediately. Public status page must be updated.
- SEV-2: A meaningful subset of users is affected, or a critical workflow is broken (auth, payments, deploys). Pages on-call. Status page if external.
- SEV-3: Limited impact; fixable during business hours. Tracked, doesn't page.
Status page communication
During an active incident, the status page is doing two jobs: telling customers what's happening, and signaling to your team that the incident is real. Update it at three points minimum:
- When the incident is acknowledged. You don't need a root cause to post "Investigating elevated error rates on checkout."
- When you have an identified cause or mitigation. "Identified — rollback in progress."
- When resolved. Brief summary, confirmation that the underlying issue is fixed, and that a postmortem will follow.
Make it a ritual
Postmortems work because they're a habit. Three rules to keep the habit:
- Run a postmortem within 48 hours of resolution, while context is fresh.
- Track action items in your normal issue tracker — not a separate spreadsheet that nobody reads.
- Review old postmortems quarterly. Patterns across incidents are where the highest-leverage systemic fixes live.
If you take only one thing from this article: the next outage you have, run the template above. The team will be better at handling the one after that, and the one after that.