All posts

reliability

Monitoring a SaaS application: the 12-endpoint starter kit

Uptimera team10 min read

Most SaaS teams set up uptime monitoring with one check on their homepage and feel covered. They aren't. The homepage 200ing tells you almost nothing about whether anyone can actually use the product. This is a concrete starter kit: twelve endpoints every SaaS application should monitor on day one, with the failure mode each one catches. Most teams can implement the full set in an afternoon.

Why twelve, not five or fifty

Five isn't enough — you'll miss whole classes of failures. Fifty is overkill — you'll spend more time tuning monitors than fixing bugs. Twelve is roughly the right surface area for a SaaS in the 0-50 engineer range: complete enough to catch real outages, small enough to maintain.

The twelve

1. The homepage / marketing site

Catches: total outage, DNS misconfiguration, edge/CDN failures. The most basic check, and the one you actually need least — most other things would fail first. Include it anyway; it's the canonical "is anything up" signal.

2. The signup endpoint

Catches: broken signup flow, captcha provider outage, email validation service down. New-user acquisition stops when this breaks, and you usually don't know until 24 hours of data is missing. Synthetic check that hits the signup form endpoint with a known monitor-only email pattern.

3. The login endpoint

Catches: auth provider outages, broken session creation, password hashing failures. Authenticated check using a dedicated monitoring account; assert the response contains the expected session token or redirect.

4. The OAuth callback URL

Catches: social login failures, provider API changes, callback URL mismatches in production after a config change. Often the silent breakage nobody notices because most users have remembered sessions. A synthetic that completes a full OAuth round-trip against a test app.

5. The main dashboard / app shell

Catches: frontend bundle errors, broken API loading, missing assets. Authenticated check that loads the post-login page and asserts the presence of a specific element (the user's name, a known nav item).

6. The primary API read endpoint

Catches: API server outages, database read failures, broken auth on the API. Most SaaS has a "list resources" endpoint that's the first thing any client calls. Monitor it.

7. The primary API write endpoint

Catches: read replica works but writes fail, broken validation logic, locked tables, full disk. Hit a write endpoint, create a synthetic record, then delete it. Tear-down matters.

8. The payment webhook receiver

Catches: broken signature verification, queue backups, dropped payment events. Hit your own webhook URL with a synthetic signed payload and assert the receiver processed it. See our webhook monitoring guide for the deeper version.

9. The status page itself

Catches: the irony of your status page going down during an outage. Monitor your status page from a different region/provider than your app, so the failure modes don't correlate. See status page best practices.

10. The SSL certificate on the main domain

Catches: expired or broken certificates, chain issues, hostname mismatches. Continuous monitoring is critical because auto-renewal failures are silent until the cert expires. See our SSL monitoring guide.

11. DNS records for the main domain

Catches: nameserver outages, propagation issues, accidental record changes, registrar problems. Check that A and AAAA records resolve to the expected IPs from multiple resolvers. See our DNS monitoring guide.

12. A heartbeat from the primary background worker

Catches: workers crashed, queue depth growing, scheduled jobs not running. Have your worker ping a heartbeat URL after each successful job cycle. Alert when the heartbeat stops. See our heartbeat monitoring guide.

Configuration for each

Reasonable defaults for the twelve:

  • Check interval: 30 seconds for items 1-7. 1-5 minutes for items 8-12 (lower frequency is fine because each check is more expensive and less time-sensitive).
  • Regions: 5 probe regions with 3-of-5 quorum for items 1-7. Single-region fine for items 8-12.
  • Timeouts: 10 seconds for HTTP checks; 30 seconds for OAuth and write-path checks (they're inherently slower).
  • Alerts: page on items 1-9, Slack on items 10-12 (cert and DNS issues have warning windows; you don't need a 3am page).

When to add more (and what to add first)

Once the twelve are running and not noisy, the natural additions are:

  • More webhook receivers (GitHub, Twilio, anything else important).
  • Tenant-specific health endpoints if you have a multi-region or multi-cluster deployment.
  • Browser-based synthetic checks for multi-step flows (signup → onboarding, checkout flow).
  • Performance budgets on the same endpoints — not just "200 OK" but "200 OK within 500ms."
  • Per-feature monitors as you launch new product surfaces. Each new major feature gets its own canary endpoint.

Anti-patterns we see often

  • Monitoring only the homepage. Your homepage can be up while your product is unusable. Half the items above will catch failures the homepage check misses entirely.
  • Using a real user's credentials for the monitoring account. Eventually that user's password rotates or their account gets disabled, and the monitor breaks at the worst time. Create a dedicated synthetic account.
  • Forgetting tear-down on write checks. A synthetic write that creates records forever pollutes data, increases costs, and sometimes triggers your own rate limits. Always delete what you create.
  • Skipping the SSL/DNS monitors because "they don't change often." They don't — until they do, catastrophically, and silently.

Where to go from here

Print this list. Check off the ones you have. The unchecked ones are your starter set for the rest of the week. Most teams find that adding 4-6 monitors from this list takes a single afternoon and catches the next major outage they would have otherwise missed.

Uptimera's free plan covers 5 monitors with multi-region quorum, SSL, and Slack alerts — enough to do items 1, 2, 3, 6, and 10 above. Most users upgrade once they hit the limit, which is usually within the first week.