Skip to content

Runbooks

Operational playbooks for the GPU/Slurm/Kubernetes failure modes this course teaches you to diagnose. Each lesson's alerts and drills link here, and several are exercised directly by the labs (for example the observability break-it drill and the Slurm drain drill).

Kubernetes / GPU stack

Observability

Scheduling

Each runbook follows the same shape: symptom, layered triage, verification, and prevention, plus a lab drill where one applies.