What do MTTR, MTBF, MTTD, and MTTA stand for and measure?

MTTD is Mean Time To Detect, the time from something breaking to your monitoring noticing. MTTA is Mean Time To Acknowledge, the time from detection to a human acknowledging the alert. MTTR is Mean Time To Recover (sometimes Repair or Resolve), the time from the start of an incident to full service restoration. MTBF is Mean Time Between Failures, the average uptime between incidents.

How is MTTR calculated?

MTTR is the sum of (resolved time minus started time) across all incidents divided by the number of incidents in your time window, typically a quarter. Because incident durations are heavily right-skewed and one long outage can dominate the average, most mature teams report the median (P50) and 95th percentile (P95) alongside the mean.

What's the difference between MTTR and MTBF?

MTTR measures how long it takes to recover once an incident starts, so it reflects your response and remediation. MTBF measures the average uptime between incidents, so it reflects how often things break, not how bad each outage is. A high MTBF means rare outages but says nothing about how quickly you recover from them.

Which of these metrics are actually worth tracking?

MTTD and MTTR are the two that move and where engineering investment shows up, so they are worth tracking. MTBF is largely a function of system complexity and how often you ship, so treat it as a lagging indicator for trend-watching rather than an improvement target. If you have none of these yet, start with MTTD from your monitoring tool and MTTR from your incident tracker.

What are reasonable MTTR, MTTD, and MTTA targets for a small team?

For a B2B SaaS with one or two on-call engineers: MTTD under 2 minutes for customer-facing surfaces and under 10 minutes for background jobs; MTTA under 5 minutes during business hours and under 10 minutes overnight; and MTTR with P50 under 30 minutes and P95 under 4 hours. Anything past 4 hours is usually a dependency failure outside your control or a database problem requiring restore.

fundamentals

MTTR, MTBF, MTTD & MTTF: Incident Metrics

Uptimera teamMay 20, 20268 min readUpdated June 30, 2026

Reliability acronyms have a way of multiplying. MTTR, MTBF, MTTD, MTTF, MTTA, MTTI — at some point every team gets asked which numbers they track, and the honest answer is usually "we've heard of them." This post is the short, opinionated version: what each metric actually measures, the formula in one line, and whether it's worth the effort to track.

The four you'll actually be asked about

Almost every conversation about incident metrics narrows down to four:

MTTD — Mean Time To Detect. From the moment something breaks to the moment your monitoring notices.
MTTA — Mean Time To Acknowledge. From detection to a human pressing the "acknowledged" button on the page. It's the metric most often confused with MTTR — see MTTA vs MTTR for the distinction.
MTTR — Mean Time To Recover. From the start of the incident to full service restoration. Sometimes also expanded as Mean Time To Repair or Resolve — the three are usually used interchangeably, but watch out: a tool that measures "repair" (until the fix ships) gives a different number than one that measures "recovery" (until users are happy again).
MTBF — Mean Time Between Failures. The average uptime between incidents. A high MTBF means rare outages; it doesn't say anything about how bad each one is.

How MTTD, MTTA, and MTTR line up across a single incident's timeline. MTTR spans the whole event, from the moment it starts to full recovery.

The formulas, in one line each

These all assume you have a list of incidents over some time window (a quarter is typical):

MTTD = sum of (detected − started) / number of incidents
MTTA = sum of (acknowledged − detected) / number of incidents
MTTR = sum of (resolved − started) / number of incidents
MTBF = total uptime in the period / number of incidents

Two warnings about the formulas. First: "mean" is a brittle statistic for incidents because incident durations are heavily right-skewed — one 12-hour outage will dominate three months of 15-minute ones. Most mature teams report the median (P50) and 95th percentile (P95) alongside the mean. Second: dividing by "number of incidents" over a quarter implies you have enough incidents for the average to be meaningful. If you had two outages last quarter, you don't have an MTTR — you have two data points.

A word on MTTF (Mean Time To Failure)

MTTF is the manufacturing equivalent of MTBF, used for components that don't get repaired — they just fail and are replaced. In software you'll almost never need it: SaaS services don't have a "final" failure, they have outages followed by recoveries. If a vendor or interview question asks about MTTF, they probably mean MTBF and you can answer accordingly.

Which metrics are actually worth tracking

Pragmatically: MTTD and MTTR are the two that move. They're where engineering investment shows up. MTBF is largely a function of how complex your system is and how often you ship — useful for trend-watching, not for improvement targets.

MTTD: where monitoring earns its keep

If your MTTD is high (10+ minutes), you're either checking too infrequently, alerting on the wrong things, or both. The biggest improvements come from cheap changes: shorter check intervals (5-minute → 30-second), monitoring the surfaces customers actually use (the signup flow, not just the homepage), and adding multi-region quorum so you stop ignoring alerts as flaky.

MTTR: where culture and tooling meet

MTTR has two stages: get the right person looking and let that person fix it. The first is escalation and on-call discipline (covered in our on-call setup post). The second is runbooks, observability, and the ability to roll back deploys. Teams with high MTTR almost always have one of three problems: nobody knows what changed, nobody knows where to look, or the rollback path is theoretical.

Reasonable targets for a small team

Rough numbers for a B2B SaaS with one or two on-call engineers:

MTTD: < 2 minutes for customer-facing surfaces, < 10 minutes for background jobs.
MTTA: < 5 minutes during business hours, < 10 minutes overnight.
MTTR: P50 under 30 minutes, P95 under 4 hours. Anything past 4 hours is usually a dependency failure outside your control or a database problem requiring restore.
MTBF: Treat as a lagging indicator. Use it to validate that reliability work is helping, not as a target.

A simple monthly reliability report

Most teams over-engineer this. A two-page monthly report is plenty:

Total uptime % vs your stated SLO (see our SLA vs SLO vs SLI guide).
Number of incidents by severity (SEV1/SEV2/SEV3).
MTTD and MTTR — median and P95 — month over month.
Top three contributors to downtime, with a one-line summary each.
One callout: what changed this month, what we're changing next.

Where to go from here

If you don't have any of these numbers yet, start with two: track MTTD via your monitoring tool (most expose this in their alert log) and MTTR via your incident tracker. Three months in, you'll have enough data to spot trends — and you'll know which way to push them. Feed the results back through your postmortem process so each incident makes the next one cheaper.

Frequently asked questions

What do MTTR, MTBF, MTTD, and MTTA stand for and measure?: MTTD is Mean Time To Detect, the time from something breaking to your monitoring noticing. MTTA is Mean Time To Acknowledge, the time from detection to a human acknowledging the alert. MTTR is Mean Time To Recover (sometimes Repair or Resolve), the time from the start of an incident to full service restoration. MTBF is Mean Time Between Failures, the average uptime between incidents.
How is MTTR calculated?: MTTR is the sum of (resolved time minus started time) across all incidents divided by the number of incidents in your time window, typically a quarter. Because incident durations are heavily right-skewed and one long outage can dominate the average, most mature teams report the median (P50) and 95th percentile (P95) alongside the mean.
What's the difference between MTTR and MTBF?: MTTR measures how long it takes to recover once an incident starts, so it reflects your response and remediation. MTBF measures the average uptime between incidents, so it reflects how often things break, not how bad each outage is. A high MTBF means rare outages but says nothing about how quickly you recover from them.
Which of these metrics are actually worth tracking?: MTTD and MTTR are the two that move and where engineering investment shows up, so they are worth tracking. MTBF is largely a function of system complexity and how often you ship, so treat it as a lagging indicator for trend-watching rather than an improvement target. If you have none of these yet, start with MTTD from your monitoring tool and MTTR from your incident tracker.
What are reasonable MTTR, MTTD, and MTTA targets for a small team?: For a B2B SaaS with one or two on-call engineers: MTTD under 2 minutes for customer-facing surfaces and under 10 minutes for background jobs; MTTA under 5 minutes during business hours and under 10 minutes overnight; and MTTR with P50 under 30 minutes and P95 under 4 hours. Anything past 4 hours is usually a dependency failure outside your control or a database problem requiring restore.

Uptimera team

We build Uptimera — multi-region uptime monitoring, SSL and DNS checks, and branded status pages. These guides come from running the same monitoring and on-call practices we write about.