operations
Incident Severity Levels: SEV-1, SEV-2, SEV-3
Severity is the first decision you make in an incident, and it's the one that drives everything after it. Before anyone knows the root cause, before the first status update, before the postmortem — someone has to say "this is a SEV-1" or "this is a SEV-3." That single call decides who gets paged, whether the status page updates, whether leadership gets pulled in, and whether you owe a postmortem when it's over. Get the scale right and the rest of your response runs on rails. Get it wrong — or skip it — and every incident becomes an argument about how much to panic.
What incident severity levels are
An incident severity level is a shared scale that makes "urgent" mean the same thing to everyone. Without one, urgency is negotiated per incident by whoever happens to be in the channel: the engineer who found it thinks it's minor, the support lead thinks it's a five-alarm fire, and the two of them burn ten minutes calibrating instead of fixing. A severity scale replaces that negotiation with a label.
The failure mode a scale prevents is the one where everything is urgent, so nothing is. If your only two states are "fine" and "emergency," your team learns that emergency doesn't actually mean drop-everything — because half the emergencies were a slow report or a cosmetic bug. A working scale is calibrated so that the top level is rare and unambiguous. When a SEV-1 fires, nobody asks "is this real?" That single word carries all the context someone needs to decide whether to open their laptop.
SEV-1, SEV-2, SEV-3 defined
Here's a three-tier scale that works for most teams. The boundaries are about blast radius and urgency, not about how hard the bug is to fix — a one-line typo that takes down checkout is a SEV-1; a gnarly race condition that slows one internal report is a SEV-3.
SEV-1 — production down or severely degraded
Most users can't do the core thing your product exists to do. The site won't load, checkout is fully down, the API errors across the board, or data is being lost or corrupted. A SEV-1 pages immediately — day, night, weekend, it doesn't matter. It updates the public status page, pulls in an incident commander, and usually notifies leadership. Example: the payment service is returning 500s and no customer can complete a purchase.
SEV-2 — a subset of users or one critical workflow affected
Something important is broken, but not for everyone and not the whole product. One workflow is down, or one segment of users is affected. It pages during business hours; overnight it can be handled more quietly — acknowledged, watched, and escalated to a SEV-1 if the blast radius grows. Example: file uploads are failing for about 20% of users on a specific browser, or a background job is backed up and reports are hours stale for one region.
SEV-3 — limited impact, fixable in business hours
A real problem, but small and containable. It creates a ticket, not a page. Nobody's sleep is interrupted; someone picks it up during the next working day. Example: a minor report renders slowly, a non-critical export is broken, or a UI element is misaligned on one page. It matters enough to track and fix, but not enough to wake anyone.
Where to draw the SEV-2 / SEV-3 line
The SEV-1 boundary is usually easy: everything is on fire, or it isn't. The boundary that actually causes trouble is between SEV-2 and SEV-3 — because that's the line between "this pages someone" and "this waits until morning." Draw it too low and you page for slow reports; draw it too high and you sleep through a broken signup flow.
This boundary is also exactly where alert fatigue starts. If half of your SEV-2s feel like SEV-3s to the person who gets paged — if the honest reaction to most SEV-2 pages is "this could have waited until 9am" — you don't have a severity problem, you have a tuning problem. Either your SEV-2 definition is too broad, or alerts that should be SEV-3 are being declared at SEV-2 out of caution. The fix is to audit a month of SEV-2s and ask, for each one, whether paging overnight actually changed the outcome. The ones where it didn't are your SEV-3s.
A useful heuristic: if a competent engineer can't meaningfully act on it before business hours — the fix needs a deploy that shouldn't happen at 3am, or it depends on a vendor who's asleep — then paging overnight just trades sleep for nothing. That's a SEV-3 dressed as a SEV-2.
Who declares severity
The rule is simple: anyone can raise an incident, the incident commander confirms the severity. The person who spots the problem — an on-call engineer, a support agent, an automated alert — proposes a level. They don't need permission, and they don't need to be right. Once someone takes command of the incident, they confirm or adjust the severity based on what's actually known.
Declaring severity has to be cheap and fast. If raising a SEV-1 requires a manager's sign-off, people will hesitate at exactly the moment hesitation is most expensive. Make it a one-line message in a channel, or a button in your incident tool, that anyone can hit.
And when it's genuinely ambiguous, err high and downgrade later. Under-calling a real SEV-1 costs you customers and trust; over-calling a SEV-2 costs you one person's slightly-interrupted evening and a thirty-second downgrade. Those are not symmetric. Downgrading is painless — you just note it in the timeline and let the extra responders stand down. The one caveat: don't declare everything a SEV-1, or you'll re-teach the team that SEV-1 means nothing. Err high on the ambiguous ones, not on the obvious SEV-3s.
How severity drives the response
The whole point of a severity level is that it maps to concrete actions, so nobody has to improvise the response playbook mid-incident. Here's a mapping that works well as a default:
- Paging. SEV-1 pages the on-call immediately, any hour. SEV-2 pages during business hours and notifies quietly overnight. SEV-3 files a ticket and pages no one. This only works if your on-call rotation has clear primary, secondary, and escalation paths.
- Status page. SEV-1 always gets a public status page update. SEV-2 gets one if customers are noticeably affected. SEV-3 usually stays internal.
- Customer comms. A SEV-1 that runs long warrants a proactive outage email to affected customers. SEV-2 is judgment-dependent. SEV-3 rarely needs outbound comms at all.
- Postmortem. SEV-1 always gets a written postmortem. SEV-2 gets one if it was customer-visible or is likely to recur. SEV-3 doesn't need a formal one, though a one-liner in the incident log is cheap insurance.
Write this table down and put it where the on-call can find it at 3am. The value of the scale is that it turns "what do we do now" into a lookup instead of a debate.
Do you need SEV-4 or SEV-5?
Almost certainly not — not yet. Three tiers cover the decisions you actually make: page now, page in hours, or file a ticket. SEV-4 and SEV-5 usually end up being bug-tracker priorities wearing an incident costume, and every level you add is one more boundary people argue about while the clock is running.
Large orgs sometimes genuinely need more granularity — a SEV-4 for "degraded but not user-visible," a SEV-5 for "cosmetic, no rush." But that's a problem you grow into, and you'll know you have it when a specific, recurring decision has no clean home on the three-tier scale. Until then, a fourth level is complexity you pay for and don't use. The best test: if your team can't instantly name a recent incident that needed a SEV-4, you don't need a SEV-4.
Pick a scale and make declaring it cheap
The severity scale you write down and use beats the more elaborate one you keep in your head. Pick three levels, define each in one sentence with concrete examples from your own product, and map each to paging, status page, comms, and postmortem expectations. Then do the thing that matters most: make declaring a severity take seconds, not a meeting.
The teams that respond well to incidents aren't the ones with the cleverest taxonomy — they're the ones where anyone can say "this is a SEV-1" without hesitation and the whole response machine spins up automatically. Start with three levels, err high when you're unsure, and downgrade without shame. That's the entire discipline.
Frequently asked questions
- What is a SEV-1 incident?
- A SEV-1 is your highest severity: production is down or so degraded that most users can't do the thing they came to do. Checkout is failing, the app won't load, the API is returning errors across the board. A SEV-1 pages immediately, day or night, updates the public status page, and pulls in an incident commander and usually leadership. It's the level that justifies waking people up.
- What is the difference between SEV-1, SEV-2, and SEV-3?
- It's a scale of blast radius and urgency. SEV-1 is production down or severely degraded for most users — page immediately. SEV-2 affects a subset of users or one critical workflow — page during business hours, handle it more quietly overnight. SEV-3 is limited impact that can wait until business hours — a ticket, no page. The level is a shorthand that tells everyone how hard to run before they read the details.
- Who decides the severity of an incident?
- Anyone can raise an incident at a proposed severity — the on-call engineer, a support agent, an alert. The incident commander confirms or adjusts it once someone owns the incident. Declaring severity is not a committee decision; you want it to happen in seconds. When in doubt, start high and downgrade later.
- Should you always start at the highest severity?
- You should err toward the higher level when it's genuinely ambiguous, because under-calling a real SEV-1 is far more expensive than over-calling. But don't reflexively declare everything a SEV-1 — that's how you train people to ignore SEV-1s. The rule is: if you can't tell whether it's a SEV-1 or SEV-2 in the first minute, treat it as a SEV-1 and downgrade once you have signal.
- How many severity levels should you have?
- Three is enough for almost everyone. SEV-1 (page now), SEV-2 (page in hours), SEV-3 (ticket) covers the decisions you actually make. SEV-4 and SEV-5 tend to be bug-tracker priorities dressed up as incidents, and every extra level is another boundary people argue about at 3am. Add levels only when a real recurring decision has no home on the existing scale.
Uptimera team
We build Uptimera — multi-region uptime monitoring, SSL and DNS checks, and branded status pages. These guides come from running the same monitoring and on-call practices we write about.