A lightweight knowledge loop after incidents
Incidents are your fastest learning events.
Without a capture loop, the learning leaks into Slack threads and memory.
- the diagnosis trick lives in someone’s head
- the mitigation lives in a Slack thread
- the “we should fix this later” becomes a forgotten TODO
You don’t need a wiki project to fix this. You need a loop.
The loop (one line)
Incident -> Capture -> Update runbook -> Reuse next time
The loop (15 minutes, once per incident)
Right after the incident (or in the next business day), do a short capture step:
- One paragraph summary
- What happened?
- What did users experience?
- What fixed it?
- Three concrete artifacts
- The “truth dashboard” link
- The command/query that confirmed diagnosis
- The mitigation steps that were actually safe
- One runbook update Add or change one section:
- fast classification
- safe mitigations first
- verification steps
- stop conditions
- One “gotcha” Write the sharp edge in plain language:
- what surprised you
- why it mattered
- what to do next time
- One owner + one deadline If it doesn’t get an owner, it won’t get done.
Why this works
This loop is intentionally small. It doesn’t require:
- a long postmortem
- perfect prose
- a new tool rollout
It creates a few durable artifacts that make the next incident faster and less stressful.
A good definition of “done”
The incident is “done” when:
- the runbook is better than it was yesterday
- the next responder can repeat the fix without guessing