Lesson 2 - Slurm GPU Workload Management¶
Get the code to run this lab
The commands on this page come from the repository, not the website. Clone it and enter this lesson's folder: git clone https://github.com/ld-singh/ai-factory-ops-lab && cd ai-factory-ops-lab/portfolio-lab/02-slurm-gpu-platform. Browse this lesson on GitHub
Course home: AI Factory Operations Lab Β· Previous: Lesson 1C - GPU sharing with HAMi Β· Next: Lesson 3 - Observability
β STATUS: RUNNABLE (Phase 3). This lesson stands up a real Slurm cluster in Docker with fake GRES and runs the four scheduling scenarios - validated, with captured output in
../06-validation-reports/slurm-gres-validation.md. The config files here are the actual ones the cluster uses. The Slurm reference for any directive is https://slurm.schedmd.com/.
Kubernetes isn't the only scheduler in AI/HPC. Slurm runs most of the world's GPU training clusters. This lesson is the Slurm counterpart to Lesson 1: same goal (schedule GPU work, diagnose why it's stuck), different scheduler - and the same cheap-learning trick applies, because fake GRES is to Slurm what KWOK fake nodes are to Kubernetes.
π― Learning objectives - when this lesson is runnable you'll be able to:
- Stand up a Slurm-in-Docker cluster and schedule GPU jobs with fake GRES (no
GPU needed), keeping fake vs real
--gres=gpustrictly separated. - Read and reason about
slurm.conf,gres.conf,cgroup.conf, andslurmdbd.conf. - Write GPU job scripts (small, large, array, cuda-check) and submit them.
- Apply QoS limits and fair-share, and read accounting data (
sacct/sreport). - Run drain/resume drills and triage pending reasons (the Slurm analogue of Lesson 1's Pending-pod triage).
π§ Mode: π¦ Simulation (fake GRES, no GPU) for scheduling logic; optional π₯ real
--gres=gpu validation on the Lesson 6 hardware.
π‘ Why fake GRES is legitimate (same idea as Lesson 1): slurmctld scheduling
does not require the device to exist - GRES scheduling is control-plane logic. So
fake GRES proves Slurm's scheduling behaviour, and nothing about CUDA. The same
sim-vs-real boundary you learned in Lesson 1 applies here.
Running it, step by step¶
Run these one at a time β not as a single block. The cluster stays up between
steps, so you can inspect it by hand (step 3). Save the last one, phase3-down, for when
you're completely finished.
1. Start the cluster β builds + starts slurmctld / slurmdbd / MariaDB / 2Γ slurmd / login and bootstraps accounting:
2. Submit the four scenarios and print the queue + pending reasons:
3. Inspect what the demo did, by hand. Open a shell on the login node (the cluster is still up):
From inside that shell, run these one at a time β first the fleet, then the queue:
To inspect a single job you need its job id β the JOBID (first) column of
squeue. Copy one from there and pass it to scontrol:
scontrol show job <JOBID> # e.g. `scontrol show job 12` - requested gres, assigned node, pending reason
Type exit to leave the login shell. (You can re-open it the same way after step 4 too.)
4. Drain/resume drill β drain a node, watch work route around it, then resume it:
5. Capture evidence β sinfo / squeue / sacct / qos into 06-validation-reports/:
6. Tear it all down β only when you're done (removes containers + volumes):
β
Checkpoint - the four scenarios. After make phase3-demo you should see, via
squeue:
gpu-smallβ RUNNING (1 GPU fits).gpu-toobigβ REJECTED at submit ("Requested node configuration is not available") - the impossible 16-GPU request. This is the Slurm-vs-Kubernetes contrast: K8s would accept it and leave it Pending forever; Slurm refuses up front because no node could ever satisfy it.gpu-qoscapβ PENDINGQOSMaxGRESPerUser- the QoS cap (4 GPUs/user) in action; quota enforcement as an accounting decision, exactly like Lesson 1B.qparray β 16 RUNNING, the rest PENDINGResources- both nodes fully allocated (gres/gpu=8each), GPUs the binding constraint.
A lesson is only "done" when your own run's output is captured. make phase3-evidence
writes it; reference it from the validation report.
VERSION NOTE: the cluster uses the Slurm 21.08 that Ubuntu 22.04 packages. The fake GPUs are 8 empty char-device nodes per compute node (
mknod, major 195) thatgres.confpointsFile=at - that's what makes slurmd registergpu:8with no driver behind it.cgroup.confis shipped to read but intentionally not loaded (21.08 predates its modernCgroupPluginsyntax, and the lab usestask/none).
Concept 1 - The moving parts¶
ββββββββββββββ accounting ββββββββββββββ βββββββββββ
sbatch ββββΆβ slurmctld ββββββββββββββββββΆβ slurmdbd βββββββΆβ MySQL/ β
squeue β (controllerβ β (accountingβ β MariaDB β
scontrol β = the β β daemon) β βββββββββββ
β scheduler)β ββββββββββββββ
βββββββ¬βββββββ
β dispatches job steps
βββββββββββΌββββββββββ
βΌ βΌ βΌ
ββββββββββββββββββββββββββββββ
β slurmd ββ slurmd ββ slurmd β one per compute node - launches and
β node 1 ββ node 2 ββ node 3 β supervises the actual processes
ββββββββββββββββββββββββββββββ
The mapping to what you already know from Lesson 1:
| Kubernetes concept | Slurm counterpart | Note |
|---|---|---|
| kube-scheduler + API server | slurmctld |
One brain, not separate components |
| kubelet | slurmd |
Per-node agent |
| Pod / Job | Job (with job steps inside) | Slurm jobs are batch-first |
nvidia.com/gpu resource |
GRES gpu (--gres=gpu:2) |
Both are scheduler-side counts |
| ResourceQuota / KAI queue quota | QoS + associations + TRES limits | Richer, account-hierarchy-based |
| KAI fair-share between queues | Fair-share (usage-decay priority) | Built into Slurm's priority plugin |
| Gang scheduling (KAI) | Native - a job's allocation is all-or-nothing | Slurm allocates the whole job atomically |
| Pod Pending + Events | Job PD + Reason column |
Same triage skill, different spelling |
kubectl describe pod |
scontrol show job <id> |
Your main triage tool |
| Prometheus metrics | sacct / sreport accounting |
Slurm's history lives in slurmdbd |
π‘ Notice what Slurm gives you for free that Lesson 1B needed KAI for: atomic whole-job allocation (gang), fair-share, and per-account quotas. This is why training shops love Slurm - and why learning both schedulers makes each one's design choices legible.
Concept 2 - GRES vs TRES (the two acronyms that matter)¶
- GRES (Generic RESource): a per-node consumable device -
gpuis the canonical one. Declared on nodes, requested by jobs (--gres=gpu:2, or per-task forms like--gpus-per-task). - TRES (Trackable RESource): the accounting and limits view - CPU, memory, nodes, and GRES all become TRES so that QoS limits and fair-share can say things like "this account may hold at most 16 GPUs at once."
The pair is the Slurm version of Lesson 1's request/allocatable arithmetic plus Lesson 1B's quota policy, unified in one system.
The fake-GRES idea, in config shape:
# ILLUSTRATIVE slurm.conf fragment - exact directives confirmed when Phase 3 lands.
GresTypes=gpu
NodeName=node[1-2] Gres=gpu:8 CPUs=8 RealMemory=16000 # 8 "GPUs" per fake node
The controller schedules against that declaration. Whether /dev/nvidia0 exists
only matters to enforcement on the compute node (cgroup device constraint), which is
exactly the part fake GRES does not prove - same decision/enforcement split as
Lesson 1C.
Concept 3 - The job script you'll write¶
# ILLUSTRATIVE job script shape
#!/bin/bash
#SBATCH --job-name=train-small
#SBATCH --partition=gpu
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G
#SBATCH --time=00:10:00
echo "Allocated GPUs: ${CUDA_VISIBLE_DEVICES:-<none>}"
srun ./work.sh
π‘ On a real GPU node, Slurm sets CUDA_VISIBLE_DEVICES to the allocated devices -
the same environment-variable contract the
cuda-visible-devices runbook
debugs. On the fake-GRES cluster the variable is the tell-tale: scheduling succeeded,
but there's no real device behind it. That contrast will be captured explicitly in
the validation report.
Concept 4 - Pending-reason triage (the skill, previewed)¶
Lesson 1 taught you that "Pending" has distinguishable root causes. Slurm prints the
cause directly in the queue - the REASON column of squeue / scontrol show job:
| Reason (examples) | Meaning | Lesson 1 analogue |
|---|---|---|
Resources |
Nothing free that satisfies the request right now | Queue pressure (scenario 4) |
Priority |
Resources exist but higher-priority jobs go first | KAI priority ordering (1B) |
ReqNodeNotAvail |
A required node is down/drained/reserved | Node NotReady / cordoned |
Dependency |
Waits on another job (--dependency) |
initContainer-ish ordering |
QOSβ¦Limit family |
A QoS/association TRES limit (e.g. max GPUs per user) is hit | KAI quota enforcement (1B) |
JobHeldUser / JobHeldAdmin |
Explicitly held | kubectl cordon-style human action |
The triage loop you'll drill: squeue β read the reason β scontrol show job <id>
for the request β sinfo / scontrol show node for the supply side β fix or
escalate. Identical shape to the kubectl loop in Lesson 1, Step 3.
Concept 5 - Drain/resume, the operational drill¶
scontrol update nodename=<n> state=drain reason="lab drill" takes a node out of
scheduling without killing running work; state=resume brings it back. The drill -
drain, observe jobs route around it, resume, observe backfill - is the Slurm
counterpart of cordon/uncordon, and feeds the
slurm-node-drained runbook.
What's in this directory¶
docker/- the Slurm-in-Docker definition: oneDockerfile(Ubuntu +slurm-wlm),docker-compose.yml(controller, slurmdbd, MariaDB, 2Γ compute, login), andentrypoint.sh(role dispatch, munge, fake GPU device nodes).config/- the four config files, annotated line-by-line:slurm.conf,gres.conf(the fake/real boundary),cgroup.conf(read-only artifact),slurmdbd.conf.jobs/- the four sbatch scenarios.scripts/-up/demo/setup-qos/drain-drill/down.
Optional π₯ extension: run a real --gres=gpu:1 + nvidia-smi job on the
Lesson 6 machine to close the loop the way Lesson 6 did for Kubernetes. Keep its
evidence strictly separate from the fake-GRES section of the report.
π Related runbooks: slurm-job-pending-reason-gres.md, slurm-node-drained.md.
β
Evidence (when implemented): lands in
../06-validation-reports/slurm-gres-validation.md.
This lesson is only "Complete" once that report holds captured output.
π¬ What the sim will and won't prove: fake GRES proves slurmctld's scheduling,
QoS/TRES limit enforcement (an accounting decision), fair-share ordering, and the
triage workflow. It proves nothing about device binding, cgroup device isolation,
CUDA_VISIBLE_DEVICES correctness against real devices, or CUDA execution - those
require the π₯ extension. Ledger:
fake-vs-real-limitations.md.
β‘οΈ Next: Lesson 3 - Observability.