All posts

operations

Cron job and heartbeat monitoring: catching silent failures

Uptimera team8 min read

Cron jobs and background workers fail silently more often than any other part of your infrastructure. A web request that 500s shows up in the logs and probably triggers an alert. A cron job that didn't run at 2am didn't do anything — including sending you a failure notification. The absence of a successful run is the loudest silence in your stack, and it's also the easiest to miss.

Why silent failures are uniquely dangerous

A failed HTTP request is loud — at minimum it returns an error to the user, who may complain. A failed cron job is silent — by the time anyone notices the backup wasn't taken, it's been three weeks of missed backups. Common categories of jobs that fail this way:

  • Database backups. The backup didn't run, but you only find out when you need to restore.
  • Billing and invoicing. The monthly invoice batch silently skipped 12 customers; you find out at quarter-end reconciliation.
  • Email digests. Customers didn't get their weekly digest. Some notice; most don't, but engagement quietly drops.
  • Data syncs. The sync to your data warehouse paused. Reports look fine until someone runs a YoY query and the numbers stop making sense.
  • Cleanup tasks. Old data isn't being purged. Disk usage climbs. Three months later, a database fills up at 2am.

Heartbeat monitoring in one paragraph

The job phones home. You give your cron a URL; the cron pings that URL each time it completes successfully. The monitoring system tracks the time since the last ping. If too long passes without one, you get alerted that the job didn't run. The pattern is sometimes called "dead man's switch" monitoring — the alert fires when the heartbeat stops, not when an error happens.

Implementing it: the simplest version

Conceptually, all you need is a URL per job and a curl at the end of each successful run:

# in your cron job
0 2 * * * /usr/local/bin/backup.sh && \
  curl -fsS -m 10 --retry 3 \
  https://monitor.example.com/ping/abc-123 > /dev/null

The && is the critical part: curl only runs if the backup succeeded (exit code 0). Otherwise the ping doesn't fire and the alert eventually triggers.

Refinements to add as you mature:

  • Ping at start and end. Hit /start/abc-123 when the job begins, and /abc-123 when it finishes. The monitor can tell you the job started but didn't finish — which is a different incident than "never started."
  • Ping on failure with a code. Hit /fail/abc-123 in your error path. Now you get three signals: success, failure, silence. Each one is a different alert priority.
  • Include exit code or duration. Most heartbeat services accept a payload — exit code, runtime, log output. Lets you trend duration over time and catch jobs that are getting slower.

Grace periods: the parameter most teams get wrong

Every heartbeat monitor has a schedule ("expect this every 24h") and a grace period ("don't alert until N minutes past due"). Two failure modes from getting it wrong:

  • Grace too short: A job that normally finishes at 2:00am but occasionally finishes at 2:35am pages you every other week.
  • Grace too long: A daily backup with a 12-hour grace period means you don't hear about a missing backup until evening. The window of risk extended itself by half a day.

A reasonable rule: grace period = 50% of the expected runtime, capped at 1 hour. Daily jobs get 60 minutes of grace; hourly jobs get 15-30 minutes; minute-scale jobs get 60-90 seconds.

Modeling the schedule

Heartbeat services accept two ways to express expected cadence:

  • Interval mode: "expect a ping every N seconds." Simpler; doesn't care about wall-clock alignment. Right for jobs that run every X.
  • Cron expression mode: "expect a ping at 0 2 * * *." Lets the monitor understand the actual schedule. Right for jobs that run at specific times. Required if you want to alert on "the 2am job didn't run" rather than "it's been 25 hours since the last run."

Heartbeats from queue workers (BullMQ, Sidekiq, etc.)

Background workers consuming from queues need slightly different treatment. The job doesn't fail to start; it fails to process. Pattern:

  • Heartbeat on job completion. Each time the worker finishes a job, ping. If pings stop, the worker hung or crashed.
  • Monitor queue depth separately. Heartbeats tell you workers are alive; queue depth tells you they're keeping up. Both matter.
  • Track oldest pending job age. If your queue depth is healthy but a specific job has been pending for 6 hours, something is wrong with that job class.

Alerting hygiene

  • Page on the second consecutive miss, not the first. A daily backup that misses one run is concerning; missing two in a row is an actual incident.
  • Separate channels for missed-start vs missed-completion. A job that never started is one bug; a job that started but hung is another. Different runbooks, different urgency.
  • Route to engineering, not to on-call for non-critical batch jobs. A failed analytics export at 3am doesn't need to wake anyone up — it needs a ticket for the next morning.

Anti-patterns we see often

  • Logging success and calling it done. "The cron logged 'completed' — I'll know if it stops logging." You won't. Log analysis is a separate problem; logs going quiet is not an alertable event in most stacks.
  • Heartbeats inside the retry loop. If your job retries internally and pings on each attempt, a job that retries 5 times and ultimately fails will look like 5 successful runs. Ping at the outer level, after retries have terminated.
  • Ignoring partial success. A job that processes 100 records, succeeds on 95, and silently drops 5 looks like success at the heartbeat level. Pair heartbeats with success-count assertions for batch jobs.

Where to go from here

List your cron jobs and background workers. For each: is it monitored? Would you know in under an hour if it stopped? Adding heartbeats to the top 3 most-important silent jobs (typically: backups, billing, security log shipping) is usually a half-day of work and prevents at least one career-shaping outage.