Skip to content

Runbook - Slurm Job Pending with a GRES/QOS Reason

Severity: Medium - GPU work isn't running; could be a healthy queue wait or an unsatisfiable request that will never run. The first triage step is telling them apart. Applies to: Slurm GPU clusters. Exercised in the Lesson 2 simulation (fake-GRES Slurm lab); the scheduling reasons are identical on real hardware.

Symptom

  • A GPU job sits PD (Pending) in squeue, or is rejected at submit.
  • The REASON column (or scontrol show job) names a GRES/QOS-related cause.

Triage order

Slurm prints the cause directly - read it first, then confirm against the supply side.

1. Read the reason

squeue -l                          # STATE + REASON for every job
squeue -j <JOBID> -o '%i %T %r'    # one job: id, state, reason
scontrol show job <JOBID>          # full request: ReqTRES, gres, QOS, partition

Get <JOBID> from the first column of squeue. Map the reason:

Reason Meaning Will it ever run?
Resources nothing free satisfies it right now ✅ yes, when capacity frees
Priority resources exist but higher-priority jobs go first ✅ yes, eventually
QOSMaxGRESPerUser / AssocGrpGRES… a QoS/association GPU (TRES) limit is hit ⚠️ only if the limit/usage changes
ReqNodeNotAvail a required node is down/drained/reserved ⚠️ when the node returns - see node-drained runbook
Rejected at submit: Requested node configuration is not available the request can't fit any node (e.g. --gres=gpu:16 on 8-GPU nodes) ❌ never - fix the request

The submit-time rejection is the Slurm-vs-Kubernetes contrast: Slurm refuses an impossible request up front; K8s would accept it and leave the pod Pending forever.

2. Check the supply side (for Resources / Priority)

sinfo -N -l                                                          # nodes: state, idle vs allocated
scontrol show node <node> | grep -E 'Gres|CfgTRES|AllocTRES|State'  # GPUs configured vs allocated

If every GPU node is fully allocated (gres/gpu exhausted), the job is a legitimate queue wait, not a misconfiguration.

3. Check the limit (for a QOS…/Assoc… reason)

sacctmgr show qos format=Name,MaxTRESPU%20,GrpTRES%20             # per-QoS GPU caps
sacctmgr show assoc user=<user> format=User,Account,QOS,GrpTRES   # the user's limits

A QOSMaxGRESPerUser means quota enforcement is working as designed (the Lesson 1B analogue) - the fix is a smaller request, a different QoS, or a raised limit, not a bug.

Resolution verification

squeue -j <JOBID> -o '%i %T %r'   # → RUNNING, or a reason you've explained/accepted

A "fixed" job either transitions to R, or you can state precisely why it waits (capacity / priority / an intentional quota).

Prevention

  • Right-size --gres=gpu:N to the largest node; reject impossible requests in review.
  • Document QoS/association GPU limits so QOSMax… reasons are expected, not surprises.
  • Alert on jobs Pending longer than a threshold with a terminal reason (ReqNodeNotAvail, a Dependency that can't clear).

Drill in this lab

Lesson 2 make phase3-demo reproduces three of these on purpose: scenario 2 (rejected at submit), scenario 3 (QOSMaxGRESPerUser), scenario 4 (Resources under queue pressure). Walk this runbook against that queue.