Designing verification steps for runbooks
A runbook step isn’t complete until you know how to verify it. Verification turns action into confidence.
The verification rule
Every mitigation must answer two questions:
- What metric or log should change?
- How long should it take?
If you can’t answer both, the step isn’t ready.
Good verification examples
- Error rate on checkout drops below 1 percent within 5 minutes.
- Queue depth stabilizes under 2,000 items within 10 minutes.
- p95 latency returns to baseline in the main region.
Avoid these anti-patterns
- “Check the dashboard” with no specific signal.
- “Watch logs” with no expected output.
- “Wait and see” with no time limit.
Add rollback criteria
Every mitigation should include “what if it doesn’t work?” Write the rollback condition as part of the step.
Add verification to every runbook
If a runbook doesn’t say how to verify, it isn’t complete yet.