DawnOps

On-call onboarding checklist: what to include (and what to skip)

On-call onboarding fails when it tries to cover everything and still leaves new responders frozen on their first real page. The bar is a safe first decision fast. If they can’t answer “what’s broken, who’s impacted, and what’s safe to do next” in under 10 minutes, onboarding isn’t done.

This checklist is built for a real page at 2 a.m.—not a tidy wiki. Optimize for confidence under pressure, not coverage.

The 10‑minute test

A new responder should be able to answer:

  • What is the customer impact right now?
  • What’s the most likely cause?
  • What is the safest mitigation I can take without permission?

If any of those answers require tribal knowledge, onboarding is incomplete.

What to include (and why)

  • Top 5 dashboards + saved log queries. Under stress, people can’t find the “right” graph. Pre‑select it.
  • Escalation map with time zones. “Primary, backup, and approvals” is useless without on‑call hours.
  • One “first checks” page. Customer impact, SLO burn, recent deploys, primary feature flags.
  • A safe mitigation menu. Rollback criteria, feature flag owners, blast radius notes.
  • An incident update template. Include cadence and a named comms owner by severity.

What to skip (for now)

  • Full architecture deep dives before the first rotation.
  • Tool tours without a real incident or scenario.
  • “Read all the docs” assignments with no verification or feedback.
  • Long simulations that never lead to a runbook change.

The onboarding arc

Week 0 prep -> Shadow rotation -> First live week -> Post-rotation upgrades

A four-phase checklist

  1. Week 0 prep: access, escalation map, top runbooks, verified “first checks.”
  2. Shadow rotation: observe real pages, practice triage, write one clean update.
  3. First live week: daily 10‑minute check-ins, one safe mitigation decision.
  4. Post-rotation follow-through: update runbooks, assign owners, retire one obsolete doc.

Proof it’s working

  • Time to first independent page keeps shrinking.
  • “Unknowns” logged per incident keep trending down.
  • At least one runbook update happens after each rotation.

Keep it alive

Treat the checklist as a living artifact. Every time a responder says “I didn’t know that,” add a bullet, assign an owner, and set a due date.

Keep going