All posts

reliability

The five whys: a postmortem technique that actually works

Uptimera team8 min read

The Five Whys is the most over-praised and most misapplied postmortem technique. Done well, it's a clean way to peel back the layers of an incident and arrive at a root cause that matters. Done badly, it's a blame circle that ends at "a person made a mistake" and produces no useful change. This post is about how to use it well — with a worked example from a credentialing bug that took down checkout.

What it is

Originally from Toyota manufacturing. The technique is exactly what it sounds like: start with the symptom, ask "why," ask "why" again about the answer, and keep going. Five is a guideline, not a magic number. The goal is to push past the symptom and the proximate cause to a systemic factor you can actually fix.

A worked example: the checkout outage

Real-shaped incident (composite of patterns we've seen). Walk it through one why at a time.

Symptom

Checkout was returning 500 errors for 18 minutes between 14:32 and 14:50 UTC. Approximately 40% of attempted purchases failed.

Why #1: Why did checkout return 500s?

Because the application couldn't reach the Stripe API. Every request to Stripe timed out, and our exception handler returned 500 to the user.

Why #2: Why couldn't we reach Stripe?

Because we were sending requests with an expired API key. Stripe returned 401, but our HTTP client retried indefinitely on 401 (because we'd implemented retries for "any error" years ago and never refined them), so the requests appeared to time out from our app's perspective.

Why #3: Why was the API key expired?

Because we'd rotated the key in the Stripe dashboard the previous day for a security review. We updated the production config but not the staging-mirror-deployed-to-prod config that a recent change had introduced.

Why #4: Why did we have two config sources?

Because three weeks earlier, we'd migrated to a new secrets system. The migration was "complete" according to the ticket, but a fallback path in the old code still read from the legacy secrets store when the new one returned null. The legacy store had the old key.

Why #5: Why was the fallback still in the code?

Because the migration plan included a 30-day shadow period where both stores were maintained, and the cleanup task to remove the fallback was scheduled but not done. The original engineer left the company before completing it; the ticket sat in the backlog and didn't resurface.

The root cause

The systemic factor: infrastructure migrations don't have an owner for the cleanup phase after the original implementer leaves. The proximate cause was an expired key; the underlying cause was a hand-off failure in the migration process.

From root cause to action items

The Five Whys is worthless if it doesn't produce action. From the example above, the actions might be:

  • Specific to this incident: remove the legacy secrets fallback path. (Owner: platform team. Due: this week.)
  • Specific to the retry logic: stop retrying 4xx responses; only retry 5xx and network errors. Add monitoring on 401 rate.
  • Systemic: every infrastructure migration gets a written cleanup phase with an explicit owner who is not the same person as the implementer. Owner is named at migration kickoff.
  • Process: migration tickets get a 60-day check-in regardless of whether they're closed.

Each action item has an owner, a due date, and a single deliverable. "Improve our migration process" is not an action item — it's a sentiment.

How Five Whys fails in practice

Failure mode 1: stopping at "human error"

Why #1: because Alice deployed without testing. Why #2: because Alice was tired. Why #3: because Alice has been on call for two weeks. (Stop.) Conclusion: Alice needs to be more careful.

This is the most common abuse. Every Five Whys that ends at a person is a Five Whys that hasn't been pushed far enough. Why was Alice on call for two weeks? Why was the rotation broken? Why was untested code allowed to deploy in the first place? Keep pushing until you reach a process or a system, not a person.

Failure mode 2: the linear narrative trap

Real incidents have multiple causes, not one chain. The Five Whys produces a single linear story which can be misleading. The checkout example above had at least two parallel issues (config mismatch AND wrong retry logic). Both contributed; either alone wouldn't have caused the outage.

The fix: don't insist on one chain. Run multiple Five Whys starting from different framings of the symptom. "Why did checkout fail" gives one chain; "why didn't we notice for 5 minutes" gives another; "why did it affect 40% and not 100%" gives a third. The interesting action items often come from the secondary chains.

Failure mode 3: blame ratification

Done in a culture that punishes mistakes, Five Whys becomes a bureaucratic justification for assigning blame. The questions all aim at the engineer; the conclusions all involve discipline or training. The process is the alibi for a punitive culture. No technique can fix this; only the culture can.

When not to use Five Whys

Five Whys works best for incidents with a clear causal chain. It works less well for:

  • Performance regressions. The cause is usually distributed across many small things; a single linear chain misrepresents what happened.
  • Multi-system failures. When three systems failed simultaneously due to a common upstream issue, you have a fan-out problem, not a chain. Use a different technique (causal diagram, timeline analysis).
  • Cultural or organizational incidents. When the postmortem is about how the team responded rather than what broke, "why" questions quickly become accusatory. Use a different framing.

Combining Five Whys with other techniques

Three pairings that work:

  • Timeline + Five Whys. Build the full timeline first (every event with a timestamp). Then run Five Whys on each pivot point. The timeline ensures you don't skip steps; the Five Whys ensures you go deep.
  • Five Whys + counterfactuals. After arriving at the root cause, ask: "what would have prevented this?" and "what would have detected this faster?" The first gives prevention work; the second gives detection work. Both matter.
  • Five Whys + impact analysis. Separate "why did the bug exist" from "why did it affect customers." The bug always existed; the customer impact required something else to also be true. Both chains have separate root causes.

Where to go from here

Pick your most recent incident with a clear-shaped cause. Try the technique: write down the symptom, then five whys, then the action items. Force yourself to keep going if you stop at a person. The first three times you do this, it feels slow and contrived. By the fifth time, you'll find yourself running the chain in your head during the incident itself — which is when it's most valuable. For the broader postmortem structure that this fits into, see our incident postmortem template.