All posts

fundamentals

MTTR, MTBF, MTTD, MTTF: incident metrics that actually matter

Uptimera team8 min read

Reliability acronyms have a way of multiplying. MTTR, MTBF, MTTD, MTTF, MTTA, MTTI — at some point every team gets asked which numbers they track, and the honest answer is usually "we've heard of them." This post is the short, opinionated version: what each metric actually measures, the formula in one line, and whether it's worth the effort to track.

The four you'll actually be asked about

Almost every conversation about incident metrics narrows down to four:

  • MTTD — Mean Time To Detect. From the moment something breaks to the moment your monitoring notices.
  • MTTA — Mean Time To Acknowledge. From detection to a human pressing the "acknowledged" button on the page.
  • MTTR — Mean Time To Recover. From the start of the incident to full service restoration. Sometimes also expanded as Mean Time To Repair or Resolve — the three are usually used interchangeably, but watch out: a tool that measures "repair" (until the fix ships) gives a different number than one that measures "recovery" (until users are happy again).
  • MTBF — Mean Time Between Failures. The average uptime between incidents. A high MTBF means rare outages; it doesn't say anything about how bad each one is.

The formulas, in one line each

These all assume you have a list of incidents over some time window (a quarter is typical):

  • MTTD = sum of (detected − started) / number of incidents
  • MTTA = sum of (acknowledged − detected) / number of incidents
  • MTTR = sum of (resolved − started) / number of incidents
  • MTBF = total uptime in the period / number of incidents

Two warnings about the formulas. First: "mean" is a brittle statistic for incidents because incident durations are heavily right-skewed — one 12-hour outage will dominate three months of 15-minute ones. Most mature teams report the median (P50) and 95th percentile (P95) alongside the mean. Second: dividing by "number of incidents" over a quarter implies you have enough incidents for the average to be meaningful. If you had two outages last quarter, you don't have an MTTR — you have two data points.

A word on MTTF (Mean Time To Failure)

MTTF is the manufacturing equivalent of MTBF, used for components that don't get repaired — they just fail and are replaced. In software you'll almost never need it: SaaS services don't have a "final" failure, they have outages followed by recoveries. If a vendor or interview question asks about MTTF, they probably mean MTBF and you can answer accordingly.

Which metrics are actually worth tracking

Pragmatically: MTTD and MTTR are the two that move. They're where engineering investment shows up. MTBF is largely a function of how complex your system is and how often you ship — useful for trend-watching, not for improvement targets.

MTTD: where monitoring earns its keep

If your MTTD is high (10+ minutes), you're either checking too infrequently, alerting on the wrong things, or both. The biggest improvements come from cheap changes: shorter check intervals (5-minute → 30-second), monitoring the surfaces customers actually use (the signup flow, not just the homepage), and adding multi-region quorum so you stop ignoring alerts as flaky.

MTTR: where culture and tooling meet

MTTR has two stages: get the right person looking and let that person fix it. The first is escalation and on-call discipline (covered in our on-call setup post). The second is runbooks, observability, and the ability to roll back deploys. Teams with high MTTR almost always have one of three problems: nobody knows what changed, nobody knows where to look, or the rollback path is theoretical.

Reasonable targets for a small team

Rough numbers for a B2B SaaS with one or two on-call engineers:

  • MTTD: < 2 minutes for customer-facing surfaces, < 10 minutes for background jobs.
  • MTTA: < 5 minutes during business hours, < 10 minutes overnight.
  • MTTR: P50 under 30 minutes, P95 under 4 hours. Anything past 4 hours is usually a dependency failure outside your control or a database problem requiring restore.
  • MTBF: Treat as a lagging indicator. Use it to validate that reliability work is helping, not as a target.

A simple monthly reliability report

Most teams over-engineer this. A two-page monthly report is plenty:

  • Total uptime % vs your stated SLO (see our SLA vs SLO vs SLI guide).
  • Number of incidents by severity (SEV1/SEV2/SEV3).
  • MTTD and MTTR — median and P95 — month over month.
  • Top three contributors to downtime, with a one-line summary each.
  • One callout: what changed this month, what we're changing next.

Where to go from here

If you don't have any of these numbers yet, start with two: track MTTD via your monitoring tool (most expose this in their alert log) and MTTR via your incident tracker. Three months in, you'll have enough data to spot trends — and you'll know which way to push them.