Runbooks¶
Operational playbooks for the GPU/Slurm/Kubernetes failure modes this course teaches you to diagnose. Each lesson's alerts and drills link here, and several are exercised directly by the labs (for example the observability break-it drill and the Slurm drain drill).
Kubernetes / GPU stack¶
- Device plugin not advertising GPUs
- k3s default runtime / containerd config
- GPU node not ready
- GPU Operator driver pod failing
- CUDA_VISIBLE_DEVICES debugging
- GPU memory pressure
- GPU capacity planning
Observability¶
Scheduling¶
Each runbook follows the same shape: symptom, layered triage, verification, and prevention, plus a lab drill where one applies.