operations
Webhook monitoring: the silent-failure surface no one checks
Your webhook receiver is the API endpoint your business depends on most and your monitoring covers least. Stripe sends you a payment, GitHub sends you a deploy, Twilio sends you an inbound SMS — and if your receiver returns a 500, those vendors retry for a while, then give up. The customer never knew anything went wrong. You never knew anything was missing. This post is about treating webhook receivers as the production-critical infrastructure they are.
Why webhook failures stay hidden
Three structural reasons:
- No client to complain. When your homepage 500s, a user sees a sad page and emails support. When your webhook 500s, Stripe sees a 500, logs it, retries five times over 24 hours, and stops. Nobody on your side is involved.
- Retry behavior masks the problem. Most providers retry with exponential backoff. A transient blip is invisible because attempt 2 succeeded. A sustained outage looks like a slow trickle of late events, not a flat failure.
- The receiver may be returning 200 even when broken. Common bug: catch all exceptions, return 200, log the error. Webhooks "succeed" from the provider's view, but the work never happened. The deepest version of this trap.
What to monitor
1. Synthetic webhook delivery from outside
Most providers offer a sandbox or test event mechanism. Set up a scheduled synthetic event (e.g. via the provider's "send test webhook" API) once per hour. Assert your receiver processed it: a row written to a known "monitor" table, a value updated in a known cache key.
If the provider doesn't support synthetic events, the next best thing is to monitor the public webhook URL directly with a signed test payload. Your receiver should recognize the monitor's signature and treat it as a no-op (no real side effects).
2. Volume anomaly detection
You receive ~200 Stripe webhooks per hour. Suddenly you're receiving 20. That's the symptom. Causes range from legitimate (it's a holiday weekend) to catastrophic (your signature verification broke and you're returning 401 to all of Stripe).
Alert when volume drops below X% of the 24-hour-trailing average for > Y minutes. Tune X and Y per provider — Stripe payment events vs GitHub push events have wildly different baselines.
3. Receiver response codes
Most webhook providers expose a dashboard or API showing recent delivery attempts: status codes, retry counts, payloads. Monitor these. If 5% of Stripe webhook deliveries are returning 5xx for the last hour, you have an outage you haven't noticed yet.
Stripe, GitHub, Twilio, and most reputable providers all expose webhook delivery logs via their API. A nightly cron that pulls the last 24 hours of attempts and alerts on anything > 1% failure rate is the cheapest version of this.
4. Time from receipt to processing
If your webhook receiver enqueues for async processing (common pattern), monitor the lag between "received" and "processed." If the queue is backing up, you're not processing in time — and providers will retry, leading to duplicate processing if your dedupe logic is anything less than perfect.
Signature verification: get it right
Most providers sign webhook payloads with an HMAC of the body plus a shared secret. Verifying the signature is essential to prevent attackers from sending fake events. Three failure modes we see:
- Verifying the parsed body instead of the raw body. JSON re-serialization can change byte order or whitespace; the HMAC will fail. Always verify against the raw request body before any parsing.
- Constant-time comparison. Use a library function (
crypto.timingSafeEqualin Node,hmac.compare_digestin Python). String===leaks timing information attackers can exploit. - Tolerance window for timestamps. Providers include a timestamp in the signature to prevent replay attacks. Reject events > 5 minutes old — but make sure your server's clock is in sync, or you'll reject everything.
Idempotency: not optional
Every webhook provider retries. Your receiver must handle the same event being delivered more than once without doing the work twice. The pattern:
- Extract the provider's idempotency key from the payload (Stripe's
event.id, GitHub'sX-GitHub-Deliveryheader). - Check a dedupe table: have we processed this key already? If yes, return 200 immediately without doing the work.
- If no, do the work in a transaction that includes inserting the key into the dedupe table. Both succeed together or both roll back.
- Garbage-collect old dedupe keys (provider retry windows are finite — 30 days for Stripe is enough).
Alerting strategy
Page-worthy webhook alerts:
- Receiver 5xx rate > 1% sustained for 5 minutes. Real customers depend on these.
- Synthetic webhook didn't complete the round-trip for 2 consecutive runs.
- Payment webhook volume drops > 80% for 10+ minutes. Billing data integrity issue — escalate.
Lower-priority alerts (Slack, business hours):
- 4xx rate spike (you broke something on a new webhook signature update).
- Queue lag > expected processing time × 3.
- Volume anomaly on non-critical webhooks (analytics, etc).
Recovery: replaying missed events
When you have an outage and the provider stops retrying, you need a way to backfill. Two paths:
- The provider's replay API. Stripe, GitHub, and others let you request re-delivery of past events for a specific time window. Have a runbook with the API call ready before you need it at 3am.
- A reconciliation job. A daily cron that pulls the provider's authoritative state and compares to yours. If their list of payments yesterday doesn't match yours, fill the gaps. Worth building before your first major outage.
Where to go from here
List every webhook source you depend on: Stripe, GitHub, Twilio, internal services. For each one, how would you know within 10 minutes if it stopped working? If the answer for any of them is "I wouldn't," that's the next monitor to add. The 30-minute investment of setting up a synthetic webhook pays back the first time you almost missed a broken receiver.