DawnOps

The first 15 minutes of an incident (a checklist)

Most incidents go sideways for the same reason: the first 15 minutes are unstructured.

People jump into fixes, dashboards multiply, and comms becomes a side quest. Then you lose time, confidence, and trust.

Use this checklist to make the first 15 minutes calmer and faster.

0–2   Declare roles
2–5   Define impact
5–8   Choose truth dashboard
8–12  Try one safe stabilization
12–15 Ship the first update

0–2 minutes: declare the shape

  • Name an incident lead (one person owns the timeline and decisions).
  • Name a comms lead (one person owns updates, even if they’re brief).
  • Start a single incident thread/doc and pin the link.

2–5 minutes: establish “what’s broken”

Answer these in plain language:

  • What are users experiencing?
  • What is the suspected blast radius (endpoint, region, customers)?
  • What changed recently (deploy, config, dependency)?

If you can’t answer, write “unknown” and keep going. Clarity beats perfection.

5–8 minutes: choose one “truth dashboard”

Pick one dashboard or query that represents customer impact. Examples:

  • error rate on the critical endpoint
  • p95 latency for the primary user path
  • queue depth for a key pipeline

Everything else supports that one view. If you don’t choose, you’ll chase noise.

8–12 minutes: try one safe stabilization

Prioritize actions that are reversible and low-risk:

  • roll back the last deploy
  • disable a feature flag
  • shed load gracefully
  • scale a consumer group (if it won’t melt the DB)

For each action, define how you’ll verify improvement and how long you’ll wait.

12–15 minutes: ship the first update

Your first update should be short and predictable:

  • What we see (impact + scope)
  • What we’ve tried (one or two items)
  • What’s next (one item)
  • When the next update is coming

This is how you protect focus and maintain trust.

After the incident: save what you learned

Write down:

  • the key diagnosis clue
  • the mitigation that worked
  • the runbook gap you hit
  • the dashboard/alert that misled you

That becomes training material for the next responder.

Keep going