reliability
Chaos engineering for teams without a Netflix budget
Chaos engineering has a reputation problem. It sounds like something only Netflix and Google can afford — special tools, dedicated teams, a culture of failure-injection at scale. The actual practice is much simpler: deliberately break things, in controlled ways, to learn whether your system handles failure the way you assumed it would. Most teams can run their first useful chaos experiment with no new tools and no extra budget. This post is about how.
Why even bother
Software systems develop blind spots. You assume the cache falls back to the database when it fails. Maybe it does — and maybe the database can't handle the load, but you'll only find out when the cache actually fails at 2am during peak traffic. The function of chaos engineering is to find these blind spots on purpose, during business hours, with the right people in the room, instead of by accident at the worst possible time.
The mental model that helps: chaos engineering is the production equivalent of unit tests. You wouldn't ship code without testing it. The system has interaction modes (failover, retry, backoff, degraded fallback) that need testing too — and you can't test them by reading the code.
Four experiments worth running first
Forget the Chaos Monkey for now. Start with experiments that match common failure modes you'll actually see in production.
1. Kill one instance of a load-balanced service
The hypothesis: when one app instance dies, the load balancer routes around it within X seconds and the customer doesn't notice. The experiment: pick a non-peak hour, kill one pod or VM, watch the dashboards. What you usually learn:
- How fast the LB actually removes the dead instance (often slower than expected).
- Whether in-flight requests are drained or dropped.
- Whether the remaining instances can handle the load (usually yes — that's why you have multiple).
- Whether any monitoring you have actually catches the dead instance.
2. Black-hole a third-party dependency
The hypothesis: when our payment provider is unreachable, we degrade gracefully (queue the request, show a friendly error, whatever your design says). The experiment: in a non-production environment with a recent prod data copy, point the payment provider URL at a non-routable IP or a black-hole proxy. What you usually learn:
- Whether timeouts are set (often they're not, and the request hangs forever).
- Whether circuit breakers kick in (often they don't exist).
- What error users actually see (often something developer-flavored, not customer-flavored).
- Whether other parts of the system have the same dependency you forgot about.
3. Add 500ms of latency to the database
The hypothesis: small latency increases degrade UX but don't break anything. The experiment: use a tool like tc (Linux traffic control) or your cloud's network policies to inject latency on the DB connection in staging. What you usually learn:
- Whether connection pools have appropriate timeouts (they often don't).
- How many user-facing operations break the 1-second mark with cumulative DB calls.
- Whether your N+1 query problems are worse than they look.
- Where caching would help most.
4. Fill the disk
The hypothesis: disk-fill alerts fire well before anything breaks, so we have time to respond. The experiment: in staging, intentionally fill disk to 95%. What you usually learn:
- Whether your alerts actually fire (sometimes they're on a different filesystem).
- What apps do when they can't write logs (some crash silently, some buffer forever).
- What happens to the database when WAL space fills (the answer is rarely "gracefully").
- Whether your runbook for "disk full" works.
How to actually run an experiment safely
The discipline that makes chaos engineering different from recklessness is the experimental method. Each experiment has:
- A hypothesis. Stated in advance, in writing. "We expect X to happen when we do Y."
- A blast radius. What's the worst this experiment could do? Who would be affected?
- A kill switch. How do we stop it instantly if things go sideways? Have this ready before starting.
- A success criterion. What does "the system handled it" look like? Define this before starting so you don't move the goalposts.
- Observers. At least two people watching dashboards. One person can't both run the experiment and notice that something unexpected is happening.
What teams overcomplicate on day one
Buying a chaos engineering platform
Tools like Gremlin and Chaos Mesh are excellent. They're also overkill for your first six experiments. A bash script that kills a pod and a stopwatch are enough to start. Spend the tooling budget once you've done five experiments by hand and know what you need.
Trying to break production immediately
Production chaos experiments require: a team that's done many staging experiments without surprises, mature monitoring, excellent rollback capability, customer comms ready, and a narrowly-scoped initial blast radius. Getting any one of these wrong is a real outage you caused yourself. Build the prerequisites first.
Running "chaos GameDay" theater
A whole-day exercise with 12 engineers where you simulate a big incident sounds impressive. Most teams should run small, frequent experiments (one per week) instead. The discipline comes from repetition, not from drama. Save the big GameDay for when you have a specific large risk to validate.
The cultural piece (which costs nothing)
The hardest part of chaos engineering is not technical. It's the willingness to say "we don't know if X works the way we think it does, so let's find out." That's a leadership posture. The signals:
- When an experiment surfaces a bug, the response is "great, we found it before our customers did," not "who authorized this experiment."
- The reliability work that comes out of experiments is prioritized — finding the gap is only useful if you close it.
- Postmortems for experiments that broke things are the same shape as postmortems for real incidents — see our postmortem template.
A starter cadence
For a team that has never done chaos engineering:
- Month 1: Run experiment #1 (kill an instance) in staging. Document the result. Fix anything surprising.
- Month 2: Add experiment #2 (third-party black-hole) in staging. Plus repeat #1 to confirm the fixes work.
- Month 3: Add experiment #3 (latency injection). Now you have a rotating set of three monthly experiments.
- Month 6: Consider your first production experiment — typically a kill-one-instance test during low-traffic hours with the team online.
Where to go from here
Pick one hypothesis about your system that you're not 100% sure of. ("The failover works if the primary DB dies." "The site degrades gracefully when Redis is unreachable.") Schedule 90 minutes this week to test it in staging. Bring two observers. Write down what you find. Even if everything works perfectly, you've started the practice. Most teams find something on the first experiment.