operations

How to Monitor a REST API: 9-Point Checklist

Uptimera teamMay 6, 202610 min readUpdated June 30, 2026

Most teams "monitor their API" by pinging /health every minute and calling it done. That tells you whether your process is running. It tells you nothing about whether the API still does what it promises. This is a working checklist of nine checks that turn a token health probe into actual coverage.

1. The shallow health endpoint

Hit /health or /healthz. Assert: HTTP 200, response body contains "ok" or similar. This catches the simplest failures: process crashed, port unreachable, deploy in flight. Run every 30 seconds from multiple regions with quorum. (For the layer this operates at, see TCP, ICMP, and HTTP checks.)

What a shallow probe should check: the process is running. What it shouldn't check: downstream dependencies. A shallow probe that's actually deep flaps every time a dependency hiccups and causes the load balancer to kill healthy instances.

2. The deep health endpoint

Hit /health/deep or equivalent. This endpoint should check that the API can talk to its dependencies: database, cache, queue, third-party APIs that block functionality. Returns a JSON breakdown of which dependency is up.

Important: the deep endpoint should never be wired into your load balancer's health check (that's what the shallow one is for). Deep checks are for monitoring; load balancer health checks are for routing.

3. The authenticated round-trip

Most APIs require auth. A health probe that doesn't test auth misses an entire failure mode (the auth subsystem broke). Create a long-lived monitoring API token, hit a protected endpoint, assert the response.

Two things to get right:

Scope the monitor token narrowly. Read-only on a synthetic resource. Never use a production user token for monitoring.
Rotate the monitor token through the same system as other secrets. If you rotate the auth signing key, the monitor token rotates too — automatically.

4. The write path (idempotently)

Read-only monitoring misses bugs that only show up on writes. Create a synthetic resource, then delete it. Or update a designated "canary" resource and read back the updated value.

Patterns that work:

PUT a known key with the current timestamp, GET it back, assert the value matches.
POST a synthetic record, capture its ID, DELETE it. Tear down what you create.
Use a dedicated "monitoring" tenant/account so synthetic writes don't pollute real customer data.

5. Latency budgets per endpoint

Don't just monitor "did it return 200." Monitor how long it took to return 200. Set a budget per endpoint based on your SLO:

Read endpoints: typically 200-500ms P95.
Write endpoints: 500-1500ms P95, depending on what they do.
Long operations (export, batch): either monitor differently (queue depth, age of oldest job) or set the budget to seconds, not milliseconds.

Alert when the rolling P95 over 5 minutes breaches the budget for more than 3 windows in a row. This catches gradual degradation before it becomes an outage.

6. Response schema validation

A 200 OK with a malformed body breaks clients silently. The 200 Of Doom (status code lies, body is junk — see our HTTP status codes field guide for the full taxonomy). If you have an OpenAPI schema, validate against it. Even a lightweight check — "response must be JSON with field users as an array" — catches API contract regressions that an HTTP-only monitor misses entirely.

7. Rate limit headers

Your API exposes X-RateLimit-Remaining or similar. Monitor these headers and alert when they trend toward zero across your monitor token's requests. Two early-warning signals come out:

A misbehaving client is hammering an endpoint — you see your own rate limit being consumed.
The rate limiter is broken — the header returns nonsense or stops updating.

8. Pagination and large responses

A common production bug: pagination breaks under load. Monitor an endpoint that returns a paginated list. Assert that page 1 contains records, that next_cursor is present, and that fetching page 2 returns a different set.

Variant for cursor-based APIs: assert the same record never appears on both consecutive pages. Catches off-by-one bugs that only show up at scale.

9. Error response shape

Trigger an intentional error: GET /users/00000000-bad-id should return 404 with a structured error body, not a stack trace. Assert:

Status code is the expected error code.
Body is valid JSON, not HTML.
Body contains an expected error code field (e.g. error.code === "not_found").
Body does NOT contain stack traces, internal hostnames, SQL queries, or other leakage.

Catches: regressions where an exception handler stops working and the framework starts returning HTML error pages from a JSON endpoint. Worse: catches information disclosure (production stack traces in user-facing responses).

Check frequency: not all checks deserve 30 seconds

Running every check every 30 seconds is wasteful. A sensible cadence:

30 seconds: shallow health, deep health.
1 minute: authenticated read, latency budgets.
5 minutes: write path, schema validation, pagination.
15 minutes: error shape, rate limit headers.

Increase frequency for endpoints under active investigation; reduce for endpoints that haven't failed in 90 days.

Anti-patterns to avoid

Monitoring against production-with-real-data. Use a dedicated monitoring tenant. Writes to a real customer account during a test will eventually go wrong.
Asserting exact response bodies. Schema and key fields, yes. Exact JSON match, no — too brittle. The first non-breaking field addition breaks your monitors.
Hard-coding monitor IPs in your API logic. Tempting ("skip rate limiting for monitor traffic") but creates a bypass that's easy to abuse. Use a header or token instead.

Where to go from here

Look at your current API monitoring and check off how many of the nine you actually do. Most teams do 1-3. The next two you add catch the next class of bugs you've been blind to. The cheapest wins are #5 (latency budgets) and #6 (schema validation) — both can be added without changing your API code. If your API also emits webhooks, monitor them with the same rigor — that's a separate checklist worth reading next.

Uptimera team

We build Uptimera — multi-region uptime monitoring, SSL and DNS checks, and branded status pages. These guides come from running the same monitoring and on-call practices we write about.