All posts

reliability

Multi-region monitoring: why quorum matters more than coverage

Uptimera team8 min read

Almost every uptime monitoring vendor advertises "multi-region checks." Very few do the part that actually matters. Five probe regions don't help you if any one of those regions can wake you up at 3am over its own transit issue. The feature that turns false-positive monitoring into trustworthy monitoring is quorum — the requirement that N-of-M regions agree before an incident opens. This post is about why quorum is the single biggest false-positive killer in the category.

The single-region problem

A monitor that checks from one location is asking exactly one question: "can the probe reach the service from where the probe sits?" That's not the same as "is the service up." The gap between those two questions causes most false positives in uptime monitoring:

  • A transit provider between the probe and the service has a routing flap. The probe times out. The service is fine — every other customer can reach it.
  • The probe's data center has a brief outbound networking blip. Every monitor it runs reports failure for 90 seconds. None of them are real.
  • A regional ISP between the probe and the service starts packet-shaping or rate-limiting. The probe sees timeouts; real users on different networks see normal performance.

Each of these scenarios produces a page that wakes someone up to confirm the service is fine. After three or four such pages in a month, the team starts ignoring the alerts. That's the beginning of alert fatigue — see our alert fatigue post for the rest of that story.

Multi-region without quorum is barely better

Adding a second probe region without quorum doesn't fix the problem. It makes it worse. Now you have two regions that can each independently page you over a local issue, instead of one.

The naive multi-region setup: probe from A, B, and C. Alert if any region reports failure. False-positive rate roughly triples. This is how most monitoring products are sold — "coverage from 5 regions!" — without explaining the cost.

What quorum actually does

Quorum is the rule that an incident only opens when N of M regions confirm failure. With M=5 and N=3, an alert fires only when at least three regions independently see the failure within the same window. The probabilistic effect is dramatic:

  • Single-region transit issues affect one probe region. With 3-of-5 quorum, they never trigger an incident.
  • Bilateral routing issues between your service and one specific transit provider might affect two probe regions (the two whose default routes use that provider). Still doesn't trigger.
  • Real failures — your service is down, your DNS is broken, your TLS certificate expired — affect all five probe regions simultaneously. The quorum threshold is met within seconds.

The math is intuitive: real outages are uncorrelated with probe geography (everyone's probe sees them), while transit issuesare correlated with probe geography (only some probes are on the affected path). Quorum exploits the difference.

Picking N and M

The numbers matter. Common configurations and their tradeoffs:

2-of-3

Three regions, alert when two agree. Cheapest config. Adequate for non-critical services. Vulnerable to: if one of your three regions has a sustained issue (say, a 30-minute provider problem), you're effectively running 1-of-2, which is just single-region during that window.

3-of-5

Five regions, alert when three agree. The sweet spot for most production monitoring. Tolerates one bad region without increasing false-positive rate, and the cost of two extra probes per check is small.

2-of-3 with auto-failover to 3-of-5

Some vendors offer this: by default 2-of-3, but if the failure signal is ambiguous (some regions saying yes, some no), expand to 5 regions and wait for stronger confirmation. Best of both; more vendor logic to trust.

Quorum vs latency: a real tradeoff

Quorum has a cost: detection takes slightly longer. If you check every 30 seconds and require 3 regions to confirm within a 90-second window, your minimum time-to-detect is around 90 seconds (vs ~30 seconds for single-region). That's typically an acceptable tradeoff:

  • Lost time: 60-90 seconds per real incident. Negligible compared to the typical 5-15 minute response time from page to first action.
  • Saved time: hours per month not investigating false positives. Teams typically see 80-95% reduction in false-positive volume.

If your service is critical enough that an extra 60 seconds of detection matters, you almost certainly need active-active multi-region infrastructure — and at that scale, you're running custom monitoring with synthetic checks at sub-10-second intervals anyway. For everyone else, quorum is worth it.

Picking probe regions

Geography matters less than network diversity. Five probes that all happen to use the same transit provider give you network diversity of 1. Better picks:

  • Regional diversity: US-East, US-West, EU-Central, AP-South or AP-East, and something less common (SA-East or AF-South).
  • Network diversity: confirm the probes don't all run on the same cloud provider. A monitor that runs all five probes on AWS sees the same outage when AWS has a bad day.
  • Customer-aligned diversity: if 80% of your customers are in EU and US, weight your probe placement there. The other 20% of regions are for catching regional CDN/DNS issues; you don't need parity.

What quorum doesn't fix

Quorum kills regional false positives. It doesn't kill:

  • Service-side flakiness — a slow database query that fires every 3 minutes will be seen by all 5 probes equally. Quorum confirms it; doesn't make it less real.
  • Self-healing transients — a real 60-second outage might affect all 5 probes and trigger quorum, even though no one's in pain by the time you wake up. Pair quorum with minimum-duration thresholds.
  • Wrong-target monitoring — quorum on the wrong URL is still quorum on the wrong URL. Coverage is upstream of confidence.

Where to go from here

Check your current monitoring tool. Does it offer quorum? If yes, is it on by default for production monitors? If no, that is a real reason to consider switching tools — the false-positive reduction compounds across your whole team's relationship with alerts. Uptimera does 3-of-5 quorum on every monitor by default; see the features page for the regions and configuration.