Skip to content

Runbook - Slurm Node Drained / Down

Severity: Medium-High - a drained/down node removes GPU capacity from the fleet; jobs needing it pile up Pending (ReqNodeNotAvail / Resources). Sometimes intentional (maintenance), sometimes a health trigger - tell them apart before resuming. Applies to: Slurm GPU clusters. Drill exercised in the Lesson 2 simulation (fake-GRES Slurm lab, make phase3-drain).

Symptom

  • sinfo shows a node in drain, drained, draining, or down*.
  • GPU jobs stay Pending with ReqNodeNotAvail, or the fleet is short on capacity.

Triage order

1. Find which nodes, and why

sinfo -R                                          # nodes in a down/drain state + the REASON string
sinfo -N -l                                       # full node list with states
scontrol show node <node> | grep -E 'State|Reason|Gres|AllocTRES'

The Reason is the key signal. Distinguish:

Reason pattern Likely cause Action
lab drill, maintenance, an operator name/ticket intentional drain resume when the work is done (step 3)
Kill task failed, Prolog/Epilog error, Low RealMemory, gres count too low a health/config trigger fix the underlying issue first, then resume
Not responding / down* slurmd unreachable on the node check the node + slurmd before anything else (step 2)

2. If the node is down/Not responding

scontrol ping                                     # is slurmctld healthy?
# on the node: is slurmd up and can it reach the controller?
systemctl status slurmd
journalctl -u slurmd --since "30 min ago" | tail -30

A gres count mismatch (node advertises fewer GPUs than gres.conf declares) keeps a node drained - see device-plugin / GRES registration for the equivalent "node not advertising GPUs" triage.

3. Resume the node (only once it's actually healthy)

scontrol update nodename=<node> state=resume
sinfo -N -l | grep <node>                         # → idle/mixed, REASON cleared

To take a node out for maintenance the same way the drill does:

scontrol update nodename=<node> state=drain reason="maintenance: <ticket>"

drain lets running jobs finish but accepts no new ones; resume returns it to service.

Resolution verification

sinfo -R                          # → empty (no drained/down nodes), or only the ones you intend
squeue -l                         # jobs that were ReqNodeNotAvail now schedule

Prevention

  • Always set a meaningful reason= when draining, so the next operator can tell intentional from health-triggered at a glance.
  • Alert on unexpected drains (a Reason that isn't a known maintenance string).
  • Track drained-node count as lost GPU capacity; reconcile gres.conf with real device inventory so health drains aren't config drift.

Drill in this lab

Lesson 2 make phase3-drain drains a compute node, shows work routing to the other node, then resumes it - run this runbook alongside it to practice the drain → diagnose → resume loop.