On-call onboarding checklist: what to include (and what to skip)
On-call onboarding fails when it tries to cover everything and still leaves new responders frozen on their first real page. The bar is a safe first decision fast. If they can’t answer “what’s broken, who’s impacted, and what’s safe to do next” in under 10 minutes, onboarding isn’t done.
This checklist is built for a real page at 2 a.m.—not a tidy wiki. Optimize for confidence under pressure, not coverage.
The 10‑minute test
A new responder should be able to answer:
- What is the customer impact right now?
- What’s the most likely cause?
- What is the safest mitigation I can take without permission?
If any of those answers require tribal knowledge, onboarding is incomplete.
What to include (and why)
- Top 5 dashboards + saved log queries. Under stress, people can’t find the “right” graph. Pre‑select it.
- Escalation map with time zones. “Primary, backup, and approvals” is useless without on‑call hours.
- One “first checks” page. Customer impact, SLO burn, recent deploys, primary feature flags.
- A safe mitigation menu. Rollback criteria, feature flag owners, blast radius notes.
- An incident update template. Include cadence and a named comms owner by severity.
What to skip (for now)
- Full architecture deep dives before the first rotation.
- Tool tours without a real incident or scenario.
- “Read all the docs” assignments with no verification or feedback.
- Long simulations that never lead to a runbook change.
The onboarding arc
Week 0 prep -> Shadow rotation -> First live week -> Post-rotation upgrades
A four-phase checklist
- Week 0 prep: access, escalation map, top runbooks, verified “first checks.”
- Shadow rotation: observe real pages, practice triage, write one clean update.
- First live week: daily 10‑minute check-ins, one safe mitigation decision.
- Post-rotation follow-through: update runbooks, assign owners, retire one obsolete doc.
Proof it’s working
- Time to first independent page keeps shrinking.
- “Unknowns” logged per incident keep trending down.
- At least one runbook update happens after each rotation.
Keep it alive
Treat the checklist as a living artifact. Every time a responder says “I didn’t know that,” add a bullet, assign an owner, and set a due date.