Your first incident simulation (a starter recipe)
If you’ve never run a realistic incident simulation, your first one should be simple, common, and measurable.
Here’s a recipe you can run next week.
Pick a common failure mode
Choose one:
- retry storm causing saturation
- consumer lag causing user-facing latency
- dependency degradation causing timeouts
- database connection exhaustion
Pick something your team will recognize.
Define the objective
Examples:
- restore p95 latency under 500ms within 20 minutes
- stop error budget burn within 15 minutes
- reduce consumer lag back to baseline within 25 minutes
Make it measurable and timeboxed.
Prepare the signals
Give responders:
- one dashboard that matters
- one dashboard that is misleading
- a short log snippet that points in a plausible direction
Do not make it a scavenger hunt. The point is decision-making.
Provide two mitigation paths
Example for consumer lag:
- scale consumers (quick, might increase DB pressure)
- reduce upstream load (safer, may impact features)
- rollback a deploy (safe if you have confidence)
Force tradeoffs.
Run the simulation (30–40 minutes)
Rules:
- one person is incident lead
- one person is comms lead
- everyone narrates decisions (“what I’m doing and why”)
This is how you improve thinking, not just clicking.
Debrief (15–20 minutes)
Answer:
- What slowed diagnosis?
- What was unsafe or unclear?
- What runbook steps were missing?
- What telemetry gaps were revealed?
- What changes will we make this week?
Assign owners. The debrief only matters if it produces follow-through.
Repeat
Run it again in 6–8 weeks. Compare:
- time-to-diagnose
- time-to-mitigate
- quality of comms
- runbook usability
That’s how readiness becomes real.
If you want a guided version of this format, DawnOps is built to help teams run repeatable simulations and measure improvement over time.