Runbook - Slurm Job Pending with a GRES/QOS Reason¶
Severity: Medium - GPU work isn't running; could be a healthy queue wait or an unsatisfiable request that will never run. The first triage step is telling them apart. Applies to: Slurm GPU clusters. Exercised in the Lesson 2 simulation (fake-GRES Slurm lab); the scheduling reasons are identical on real hardware.
Symptom¶
- A GPU job sits
PD(Pending) insqueue, or is rejected at submit. - The
REASONcolumn (orscontrol show job) names a GRES/QOS-related cause.
Triage order¶
Slurm prints the cause directly - read it first, then confirm against the supply side.
1. Read the reason¶
squeue -l # STATE + REASON for every job
squeue -j <JOBID> -o '%i %T %r' # one job: id, state, reason
scontrol show job <JOBID> # full request: ReqTRES, gres, QOS, partition
Get <JOBID> from the first column of squeue. Map the reason:
| Reason | Meaning | Will it ever run? |
|---|---|---|
Resources |
nothing free satisfies it right now | ✅ yes, when capacity frees |
Priority |
resources exist but higher-priority jobs go first | ✅ yes, eventually |
QOSMaxGRESPerUser / AssocGrpGRES… |
a QoS/association GPU (TRES) limit is hit | ⚠️ only if the limit/usage changes |
ReqNodeNotAvail |
a required node is down/drained/reserved | ⚠️ when the node returns - see node-drained runbook |
Rejected at submit: Requested node configuration is not available |
the request can't fit any node (e.g. --gres=gpu:16 on 8-GPU nodes) |
❌ never - fix the request |
The submit-time rejection is the Slurm-vs-Kubernetes contrast: Slurm refuses an impossible request up front; K8s would accept it and leave the pod Pending forever.
2. Check the supply side (for Resources / Priority)¶
sinfo -N -l # nodes: state, idle vs allocated
scontrol show node <node> | grep -E 'Gres|CfgTRES|AllocTRES|State' # GPUs configured vs allocated
If every GPU node is fully allocated (gres/gpu exhausted), the job is a legitimate
queue wait, not a misconfiguration.
3. Check the limit (for a QOS…/Assoc… reason)¶
sacctmgr show qos format=Name,MaxTRESPU%20,GrpTRES%20 # per-QoS GPU caps
sacctmgr show assoc user=<user> format=User,Account,QOS,GrpTRES # the user's limits
A QOSMaxGRESPerUser means quota enforcement is working as designed (the Lesson 1B
analogue) - the fix is a smaller request, a different QoS, or a raised limit, not a bug.
Resolution verification¶
A "fixed" job either transitions to R, or you can state precisely why it waits
(capacity / priority / an intentional quota).
Prevention¶
- Right-size
--gres=gpu:Nto the largest node; reject impossible requests in review. - Document QoS/association GPU limits so
QOSMax…reasons are expected, not surprises. - Alert on jobs Pending longer than a threshold with a terminal reason (
ReqNodeNotAvail, aDependencythat can't clear).
Drill in this lab¶
Lesson 2 make phase3-demo reproduces
three of these on purpose: scenario 2 (rejected at submit), scenario 3
(QOSMaxGRESPerUser), scenario 4 (Resources under queue pressure). Walk this runbook
against that queue.