operations
How to fix alert fatigue (without going dark on real incidents)
Alert fatigue is the slow erosion of trust in your monitoring system. It happens when too many alerts fire that don't require action, until the on-call engineer's default response to a page is "probably nothing." That is the moment monitoring stops working — not when an alert is missed, but when a real alert is dismissed as noise. This post is about how to diagnose and fix the noise without going dark on real incidents.
The symptoms (in order of severity)
- Stage 1: The on-call mutes specific alert channels when sleeping.
- Stage 2: The team has a shared Slack channel where someone routinely says "ignore this one."
- Stage 3: A new engineer asks "is this real?" about a page and the team can't answer without investigating.
- Stage 4: A real incident goes unacknowledged for >30 minutes because it looked like the usual noise.
Stage 4 has happened to almost every engineering org over five years old. It is the textbook outcome of alert fatigue: the page that mattered looked exactly like the 200 pages before it that didn't.
Categorize the noise before you fix it
Noisy alerts come in four flavors. The fix is different for each; you have to triage before you tune.
1. False positives (the alert was wrong)
The monitor said something failed; nothing was actually broken. Common causes: single-region check, no quorum, time-of-day pattern the monitor doesn't know about (your nightly batch job spikes latency), threshold set too tight.
2. True but not actionable
The alert is technically accurate, but nothing the on-call can do will fix it within their shift. Examples: a third-party API is slow, a customer's misconfigured webhook is failing, disk on a non-critical worker is filling. These shouldn't page.
3. True, actionable, but self-healing
Something broke, but the system fixed itself in under 5 minutes. An auto-scaling event, a brief network blip, a retry that succeeded. The page woke someone up to read "all clear."
4. Symptom alerts when you have cause alerts
Your "500 rate is high" alert fires every time your "database connection pool exhausted" alert fires. You get paged twice for the same incident — once for the symptom, once for the cause. Two pages, same incident; double the noise.
The three tactics that actually reduce volume
Tactic 1: multi-region quorum (kills most false positives)
Require N-of-M regions to confirm a failure before opening an incident. If your monitor uses 5 probe regions and requires 3 to agree, single-region transit issues stop generating pages. The false-positive rate typically drops 80%+ overnight. If your monitoring tool doesn't support this, switching tools is worth the cost of switching.
Tactic 2: dependency-aware suppression
When alert B is always caused by alert A, suppress B when A is active. Most modern monitoring systems support this via "depends on" relationships or service maps. The simpler version: route symptom alerts to a Slack channel while only the root-cause alert pages.
Tactic 3: minimum-duration thresholds
Don't page on transient failures. A common rule: only page if the failure persists for ≥ 3 consecutive checks. For a 30-second check interval, that's 90 seconds of confirmed bad before a page. This kills self-healing alerts without meaningfully delaying real ones (which last hours, not minutes).
How to audit an existing alert set
A two-week exercise that pays for itself. Each time the on-call gets paged, they answer four questions in a shared spreadsheet:
- What was the alert? Name / monitor / link.
- Was it real? Yes / no / partially.
- Was it actionable? Yes / no.
- If both yes, what did you do? One sentence.
After two weeks, sort by alert name and count. Patterns become obvious:
- Alerts where most rows are "not real" → false positives. Tune the threshold or add quorum.
- Alerts where most rows are "not actionable" → demote to non-paging.
- Alerts where the response is always the same → automate the response or create a runbook so the next person doesn't have to figure it out.
Never delete an alert — demote it
When you decide an alert isn't worth paging on, demote it instead of deleting it. Three useful tiers:
- Page — wakes someone up.
- Slack ping — goes to a channel during business hours, no overnight delivery.
- Ticket — creates a low priority issue for someone to look at within the week.
The lower tiers preserve the signal. A "something feels off" hunch can be useful for debugging next week's real incident; deleting the alert throws away that future value.
A cultural rule worth adopting
After every overnight page, the on-call gets to choose: tune the alert, demote it, or pin it as "keep paging." The point is to make the decision explicit. If they choose "keep paging" three weeks in a row, the alert is real. If they choose "tune" — they have authority to do so without asking permission.
The pernicious form of alert fatigue is "we can't turn that one off because what if it's real this time." The cure is making it nobody's call but the person who got paged.
Where to go from here
Run the two-week audit. Categorize what comes out of it. Apply quorum, dependency suppression, and minimum-duration thresholds to the noisiest alerts first. Most teams cut their page count by more than half in a month — and the resulting trust in the remaining alerts is worth more than the time saved.