All posts

operations

Status page best practices: what to show during an outage

Uptimera team10 min read

A status page is the most-read document your engineering org will ever publish. During an outage, it gets more traffic than your marketing site. The choices you make about what to display, how often to update, and how to talk to customers in the middle of a fire have outsized impact on whether they trust you afterwards. This post is a practical guide to running a status page that works for you instead of against you.

What a status page is actually for

There are three audiences, and you can't please all of them with the same page if you don't know which is which:

  • Customers in pain. They want to know: is it just me? When will it be fixed? Can I tell my users?
  • Customer support. They want a link they can paste into 200 tickets that answers questions before they're asked.
  • Prospects evaluating your product. They want to see history. A status page with no past incidents is a status page with no history of telling the truth.

The components of a good status page

1. Overall status, prominently

A single banner at the top — green "All systems operational," yellow "Some systems experiencing issues," red "Major outage." This is the only thing 90% of visitors read. Make sure it's accurate within seconds of an incident opening.

2. Per-component breakdown

Group your service into 5-12 components that map to how customers think — not how your team thinks. "API," "Dashboard," "Webhooks," "Login," "Reporting." Not "us-east-1 ECS cluster" or "Kafka broker 4."

3. Incident timeline

For the active incident: an updates log, newest at top, with a timestamp for each entry. After resolution: a summary with start and end times, components affected, and a link to the postmortem when it's published.

4. Historical uptime

A 90-day strip for each component. Green days, yellow for degraded, red for outages. This is what prospects compare. A 90-day strip with three red squares is more trustworthy than a 90-day strip with none.

5. Subscribe options

Email, SMS, RSS/Atom, webhook, Slack. Don't force visitors to create an account. The point is letting them get notified of future incidents without checking back.

Writing incident updates: a four-part structure

Every update — even the first one, written 90 seconds in — should follow the same skeleton. Templates reduce panic-writing mistakes.

Part 1: What we're seeing

One sentence, customer-facing terms. "Some users are seeing 500 errors when uploading files." Not "Storage backend is returning malformed responses for keys starting with app-."

Part 2: Impact

Who's affected and how. "This affects approximately 20% of upload attempts; affected users will see an error and can retry." Quantify when you can. "Approximately" is honest; "a small number of" is not.

Part 3: What we're doing

Present tense. "Engineers are investigating" is fine early. "We've rolled back the last deploy and are monitoring recovery" is better when you have it. Never write "the issue has been identified" unless it has.

Part 4: Next update

When to expect the next update. "Next update in 30 minutes." This single sentence is the difference between a calm support queue and a panicked one. People are patient if they know when to check back.

Update cadence: how often is right

A rule that works: update every 30 minutes during an active incident, even if there's nothing new to say. "Still investigating; next update in 30 minutes" is a valid update. Silence is worse than "no new information."

After resolution, post one more update with the resolution time and a placeholder for the postmortem if you plan one. Link to the full postmortem when it's ready — ideally within a week.

Phrases to avoid (and what to write instead)

  • "A small number of users" → quantify if at all possible. "Around 5% of API requests" or "customers in the EU region."
  • "We are aware" → everyone already assumes you're aware. Say what you're doing instead.
  • "We apologize for any inconvenience" → say it once at the end. Don't open every update with it.
  • "Working with our provider" → fine, but name the provider when you can ("Working with AWS support on a regional networking issue"). Concrete information builds trust.
  • "Should be resolved soon" → never make this promise. Time estimates break trust faster than honesty about uncertainty.

Host the status page somewhere different

Your status page must not depend on the infrastructure you're reporting on. The textbook example: status.example.com hosted on the same CDN as example.com. When the CDN goes down, both go down. Customers can't reach the status page during the outage they need it for.

Two ways to solve this:

  • Use a third-party status page service (StatusPage, Better Stack, Uptimera) — they run on different infrastructure than your service.
  • Self-host on a different provider than your main app. Production on AWS? Status on Fly.io or Vercel. The point is independent failure domains.

Design choices that matter

  • Color-blind safe. Don't rely on red/green alone — pair with icons or text labels. ~8% of men have red-green color vision deficiency.
  • Mobile-first. Most status page reads during an outage come from phones. Test on mobile before launch.
  • Loads fast. Server-render the page; avoid loading a SPA that needs JS to display the banner. People with bad connections (e.g. the connection your service is failing on) need it to render.
  • Branded but restrained. Your logo, your colors. Not your marketing site's hero image. A status page is a document, not a homepage.

Who actually writes updates during an incident

The biggest mistake teams make is making the on-call engineer write status page updates. They're debugging; their attention is needed elsewhere. The cleanest pattern:

  • Incident commander (a named role for the incident, not necessarily the most senior engineer) is responsible for the status page.
  • Engineers report status to the IC. The IC writes the update.
  • A second person (support lead, comms) reviews the wording before posting if time allows. During fast-moving incidents, skip the review.

Templates accelerate this. Have three pre-written first-updates ready to go: "investigating", "degraded", "major outage." Fill in the impact line, post it. You bought 30 minutes of customer patience in 60 seconds.

Where to go from here

If you don't have a status page yet, you can ship one this week. If you have one, run a tabletop exercise: pretend the API is down right now, draft the first three updates, and time it. Most teams discover that their first update would take 20 minutes to write. Templates and clear ownership cut that to 2 minutes.

Uptimera includes branded status pages on every plan — public, themeable, on your own domain. Create one in 5 minutes from the dashboard.