Runbooks¶

Operational playbooks for the GPU/Slurm/Kubernetes failure modes this course teaches you to diagnose. Each lesson's alerts and drills link here, and several are exercised directly by the labs (for example the observability break-it drill and the Slurm drain drill).

Kubernetes / GPU stack¶

Observability¶

DCGM exporter has no metrics

Scheduling¶

Each runbook follows the same shape: symptom, layered triage, verification, and prevention, plus a lab drill where one applies.