Runbook - Slurm Node Drained / Down¶
Severity: Medium-High - a drained/down node removes GPU capacity from the fleet; jobs
needing it pile up Pending (ReqNodeNotAvail / Resources). Sometimes intentional
(maintenance), sometimes a health trigger - tell them apart before resuming.
Applies to: Slurm GPU clusters. Drill exercised in the Lesson 2 simulation
(fake-GRES Slurm lab, make phase3-drain).
Symptom¶
sinfoshows a node indrain,drained,draining, ordown*.- GPU jobs stay Pending with
ReqNodeNotAvail, or the fleet is short on capacity.
Triage order¶
1. Find which nodes, and why¶
sinfo -R # nodes in a down/drain state + the REASON string
sinfo -N -l # full node list with states
scontrol show node <node> | grep -E 'State|Reason|Gres|AllocTRES'
The Reason is the key signal. Distinguish:
| Reason pattern | Likely cause | Action |
|---|---|---|
lab drill, maintenance, an operator name/ticket |
intentional drain | resume when the work is done (step 3) |
Kill task failed, Prolog/Epilog error, Low RealMemory, gres count too low |
a health/config trigger | fix the underlying issue first, then resume |
Not responding / down* |
slurmd unreachable on the node | check the node + slurmd before anything else (step 2) |
2. If the node is down/Not responding¶
scontrol ping # is slurmctld healthy?
# on the node: is slurmd up and can it reach the controller?
systemctl status slurmd
journalctl -u slurmd --since "30 min ago" | tail -30
A gres count mismatch (node advertises fewer GPUs than gres.conf declares) keeps a node
drained - see device-plugin / GRES registration
for the equivalent "node not advertising GPUs" triage.
3. Resume the node (only once it's actually healthy)¶
scontrol update nodename=<node> state=resume
sinfo -N -l | grep <node> # → idle/mixed, REASON cleared
To take a node out for maintenance the same way the drill does:
drain lets running jobs finish but accepts no new ones; resume returns it to service.
Resolution verification¶
sinfo -R # → empty (no drained/down nodes), or only the ones you intend
squeue -l # jobs that were ReqNodeNotAvail now schedule
Prevention¶
- Always set a meaningful
reason=when draining, so the next operator can tell intentional from health-triggered at a glance. - Alert on unexpected drains (a
Reasonthat isn't a known maintenance string). - Track drained-node count as lost GPU capacity; reconcile
gres.confwith real device inventory so health drains aren't config drift.
Drill in this lab¶
Lesson 2 make phase3-drain drains a
compute node, shows work routing to the other node, then resumes it - run this runbook
alongside it to practice the drain → diagnose → resume loop.