All posts

fundamentals

What is uptime monitoring? A complete guide

Uptimera team9 min read

Uptime monitoring is the practice of continuously checking that a service — a website, an API, a database, a webhook receiver — is reachable and behaving the way it should. It's the smoke detector for your production systems: cheap to install, easy to ignore until you need it, and devastating to be without when something catches fire.

This guide explains what uptime monitoring actually does under the hood, what a service level agreement really promises, and how to choose a tool you won't outgrow after three months.

How uptime monitoring works

At its core, an uptime monitor does one boring thing on a schedule: send a request to a URL and check the response. A modern monitor adds layers on top of that loop:

  • Probes from multiple regions. Checks run from worker pools in different geographies so a regional network blip doesn't look like a global outage.
  • Quorum-based incident triggering. Most products only open an incident if multiple regions confirm the failure. This kills the "false positive at 3am" problem that gives on-call engineers nightmares.
  • Multiple check types. HTTP status checks are the baseline, but production monitors also do TCP port checks, SSL certificate expiry tracking, content-match checks ("the response must contain the word healthy"), and response-time thresholds.
  • Alert fan-out. When an incident is opened, the monitor pushes the alert to the right channels: email, SMS, Slack, signed webhooks for PagerDuty or Opsgenie. Routing rules decide who gets paged.
  • Public status pages. A customer-facing page that shows in real-time which of your services are healthy. The single highest-leverage trust signal you can ship.

What you can actually monitor

The mental model most teams start with is "monitor my website". That's the table-stakes case. A few less-obvious surfaces that cause real outages when left unmonitored:

  • Login and signup endpoints. A 200 OK on the landing page tells you nothing if the actual flow new users care about is broken. Monitor the auth surface explicitly.
  • Payment and webhook receivers. The endpoints Stripe, GitHub, and your CI provider hit. If they start 500'ing, the failure is silent until invoices, deploys, or notifications dry up.
  • CDN and edge caches. A regional CDN failure shows up as "up" in single-region monitors. Multi-region uptime monitoring catches it.
  • SSL certificates. An expired cert is a self-inflicted P1. Modern monitors track expiry and warn you 30, 14, 7, and 1 day out.
  • Cron jobs and scheduled tasks. Via heartbeat monitoring: the job calls a webhook each successful run, and the monitor alerts when the heartbeat goes silent.

What an SLA actually means

SLAs (Service Level Agreements) are usually expressed in "nines." The number of nines determines how much downtime you're allowed each month.

  • 99% uptime: ~7 hours of downtime per month allowed.
  • 99.9% (three nines): ~43 minutes per month.
  • 99.99% (four nines): ~4 minutes 22 seconds per month.
  • 99.999% (five nines): ~26 seconds per month — effectively zero.

Each additional nine is roughly 10× harder to achieve than the previous one. Most SaaS products operate around 99.9%, which is what AWS, Stripe, and GitHub commit to in their public SLAs. Promising 99.99% to customers is a serious operational claim that requires active/active multi-region infrastructure — not just a faster monitor.

How to pick an uptime monitoring tool

The bar isn't high to start; it's high not to outgrow. The questions that matter when evaluating tools:

1. How often does it check?

Free plans on most tools check every 5 minutes. That's fine for a side project; for production, you want 30-second to 1-minute intervals. A 5-minute interval means an outage runs for up to 5 minutes before you're even notified — and another minute or two before anyone wakes up. That's a customer-noticing outage every time.

2. Where does it check from?

Multi-region matters, but the quorum logic matters more. Look for tools that explicitly require N-of-M regions to confirm before opening an incident. Without that, you'll get false alerts whenever one of your provider's probe regions has a transit issue.

3. How does it alert?

Email is the baseline. SMS, Slack, and signed webhooks are non-negotiable for production. If you already pay for PagerDuty or Opsgenie, the monitor should send signed webhooks to them — not re-implement on-call rotations badly.

4. Does it have public status pages?

Branded, themeable, on a subdomain or your own custom domain. A status page is the single highest-leverage public artifact your reliability practice produces.

5. Does it have an API?

Eventually you will want to create monitors during deploys, page results into your own dashboards, or sync the monitor list with infrastructure-as-code. If the API is anemic, that day is going to be painful.

Where to go from here

Two practical next steps. First: write down the 5 most important URLs/endpoints in your product. Login, signup, the main API, your webhook receiver, your status page itself. Those are the first monitors to create. Second: decide who gets paged when each one fails — and which channel that page lands in. Most production outages are either "nobody noticed" or "the wrong person got paged." Both are solved before you write a line of code.

Uptimera's free plan gives you 5 monitors with multi-region checks and email/Slack alerts — enough to do all of the above in about ten minutes.