operations
How to monitor a REST API (with a 9-point checklist)
Most teams "monitor their API" by pinging /health every minute and calling it done. That tells you whether your process is running. It tells you nothing about whether the API still does what it promises. This is a working checklist of nine checks that turn a token health probe into actual coverage.
1. The shallow health endpoint
Hit /health or /healthz. Assert: HTTP 200, response body contains "ok" or similar. This catches the simplest failures: process crashed, port unreachable, deploy in flight. Run every 30 seconds from multiple regions with quorum.
What a shallow probe should check: the process is running. What it shouldn't check: downstream dependencies. A shallow probe that's actually deep flaps every time a dependency hiccups and causes the load balancer to kill healthy instances.
2. The deep health endpoint
Hit /health/deep or equivalent. This endpoint should check that the API can talk to its dependencies: database, cache, queue, third-party APIs that block functionality. Returns a JSON breakdown of which dependency is up.
Important: the deep endpoint should never be wired into your load balancer's health check (that's what the shallow one is for). Deep checks are for monitoring; load balancer health checks are for routing.
3. The authenticated round-trip
Most APIs require auth. A health probe that doesn't test auth misses an entire failure mode (the auth subsystem broke). Create a long-lived monitoring API token, hit a protected endpoint, assert the response.
Two things to get right:
- Scope the monitor token narrowly. Read-only on a synthetic resource. Never use a production user token for monitoring.
- Rotate the monitor token through the same system as other secrets. If you rotate the auth signing key, the monitor token rotates too — automatically.
4. The write path (idempotently)
Read-only monitoring misses bugs that only show up on writes. Create a synthetic resource, then delete it. Or update a designated "canary" resource and read back the updated value.
Patterns that work:
PUTa known key with the current timestamp,GETit back, assert the value matches.POSTa synthetic record, capture its ID,DELETEit. Tear down what you create.- Use a dedicated "monitoring" tenant/account so synthetic writes don't pollute real customer data.
5. Latency budgets per endpoint
Don't just monitor "did it return 200." Monitor how long it took to return 200. Set a budget per endpoint based on your SLO:
- Read endpoints: typically 200-500ms P95.
- Write endpoints: 500-1500ms P95, depending on what they do.
- Long operations (export, batch): either monitor differently (queue depth, age of oldest job) or set the budget to seconds, not milliseconds.
Alert when the rolling P95 over 5 minutes breaches the budget for more than 3 windows in a row. This catches gradual degradation before it becomes an outage.
6. Response schema validation
A 200 OK with a malformed body breaks clients silently. The 200 Of Doom (status code lies, body is junk). If you have an OpenAPI schema, validate against it. Even a lightweight check — "response must be JSON with field users as an array" — catches API contract regressions that an HTTP-only monitor misses entirely.
7. Rate limit headers
Your API exposes X-RateLimit-Remaining or similar. Monitor these headers and alert when they trend toward zero across your monitor token's requests. Two early-warning signals come out:
- A misbehaving client is hammering an endpoint — you see your own rate limit being consumed.
- The rate limiter is broken — the header returns nonsense or stops updating.
8. Pagination and large responses
A common production bug: pagination breaks under load. Monitor an endpoint that returns a paginated list. Assert that page 1 contains records, that next_cursor is present, and that fetching page 2 returns a different set.
Variant for cursor-based APIs: assert the same record never appears on both consecutive pages. Catches off-by-one bugs that only show up at scale.
9. Error response shape
Trigger an intentional error: GET /users/00000000-bad-id should return 404 with a structured error body, not a stack trace. Assert:
- Status code is the expected error code.
- Body is valid JSON, not HTML.
- Body contains an expected error code field (e.g.
error.code === "not_found"). - Body does NOT contain stack traces, internal hostnames, SQL queries, or other leakage.
Catches: regressions where an exception handler stops working and the framework starts returning HTML error pages from a JSON endpoint. Worse: catches information disclosure (production stack traces in user-facing responses).
Check frequency: not all checks deserve 30 seconds
Running every check every 30 seconds is wasteful. A sensible cadence:
- 30 seconds: shallow health, deep health.
- 1 minute: authenticated read, latency budgets.
- 5 minutes: write path, schema validation, pagination.
- 15 minutes: error shape, rate limit headers.
Increase frequency for endpoints under active investigation; reduce for endpoints that haven't failed in 90 days.
Anti-patterns to avoid
- Monitoring against production-with-real-data. Use a dedicated monitoring tenant. Writes to a real customer account during a test will eventually go wrong.
- Asserting exact response bodies. Schema and key fields, yes. Exact JSON match, no — too brittle. The first non-breaking field addition breaks your monitors.
- Hard-coding monitor IPs in your API logic. Tempting ("skip rate limiting for monitor traffic") but creates a bypass that's easy to abuse. Use a header or token instead.
Where to go from here
Look at your current API monitoring and check off how many of the nine you actually do. Most teams do 1-3. The next two you add catch the next class of bugs you've been blind to. The cheapest wins are #5 (latency budgets) and #6 (schema validation) — both can be added without changing your API code.