Lesson 1B - Queue-Based GPU Scheduling with KAI Scheduler¶
Get the code to run this lab
The commands on this page come from the repository, not the website. Clone it and enter this lesson's folder: git clone https://github.com/ld-singh/ai-factory-ops-lab && cd ai-factory-ops-lab/portfolio-lab/01-k8s-gpu-platform/kai-scheduler. Browse this lesson on GitHub
Course home: AI Factory Operations Lab ยท Previous: Lesson 1 - Kubernetes GPU Scheduling ยท Next: Lesson 1C - GPU sharing with HAMi
Do Lesson 1, Step 3 scenario 4 (queue pressure) first - this lesson picks up exactly where that wall is.
What this lesson teaches¶
Queue policy is the most valuable, most expensive-to-learn GPU-platform skill: which team's pod binds, in what order, and whether a group binds at all. KAI Scheduler is NVIDIA's open-source scheduler for exactly that (hierarchical queues, quota, over-quota borrowing, reclaim, gang scheduling, fair-share).
You can run KAI with no real GPU, but there is an important catch you must know up front.
The cluster requirement (read before you run)¶
KAI does not schedule against raw
nvidia.com/gpuin a node's allocatable; its GPU accounting comes from a GPU-operator topology. That is exactly why Lesson 1's fleet is KWOK + the fake-gpu-operator (the operator advertises GPUs and provides the topology). KAI simply reuses that shared Lesson 1 fleet -make uphere builds the same fleet asmake phase1-upand then installs KAI on top. No separate fleet, no duplicated setup.Two things that would otherwise waste your time (both handled by the Lesson 1 scripts this reuses): - Chart source. The fleet uses the run.ai JFrog chart (
https://runai.jfrog.io/artifactory/api/helm/fake-gpu-operator-charts-prod). Theghcr.io/run-ai/...OCI build is DRA-oriented and never populatesnvidia.com/gpu. -scheduler.kubeScheduler.imageTagis not used by KAI's chart; KAI versions itself. (That knob is HAMi's, Lesson 1C.)
๐ฏ Learning objectives - after this lesson you can:
- Explain what the default kube-scheduler cannot do for a multi-team GPU cluster.
- Install KAI Scheduler on the shared Lesson 1 fleet and point workloads at a queue.
- Design a hierarchical queue + quota model and demonstrate enforcement (the validated, runnable exercise here).
- Explain over-quota borrowing, reclaim, and gang scheduling, and read the precise limits of demonstrating them on a fake-GPU simulation.
๐งญ Mode: ๐ฆ Simulation (no GPU), via the fake-gpu-operator. Real runtime behavior (CUDA, memory isolation, MIG) is out of scope; see Lesson 6.
KAI can also slice GPUs - this lesson just doesn't. Every exercise here asks for whole GPUs (
nvidia.com/gpu: 1) to keep the focus on queue policy - quota, borrowing, reclaim, gang. KAI does fractions too, so read "whole-GPU" as a scoping choice, not a KAI limit. The hands-on sharing lab is Lesson 1C (HAMi).What KAI can do as of June 2026. A pod can ask for a slice two ways - a fixed amount of GPU memory, or a
gpu-fraction(e.g.0.5) that KAI converts to a memory limit once it picks the node. KAI owns the scheduling: which pods share which GPU. For the cap inside the container it sets aCUDA_DEVICE_MEMORY_LIMITand leaves enforcement to HAMi-core, run on each GPU node (NVIDIA/KAI-Scheduler #60, merged 2026-06-09).So KAI and HAMi aren't rivals here - they're two layers of one stack. KAI brings the queue and the scheduling; HAMi-core brings the hard memory isolation, which is exactly what Lesson 1C teaches.
๐ Prerequisites: docker, kind, kubectl, helm, jq. The lesson's make up builds
the shared Lesson 1 fleet (kind + KWOK + fake-gpu-operator, 32 GPUs) if it is not
already up, then installs KAI.
Set up once¶
cd portfolio-lab/01-k8s-gpu-platform/kai-scheduler
make up # shared Lesson 1 fleet (kind+KWOK+fake-gpu-operator, 32 GPUs) + install KAI
make queues # create the namespace + the queue hierarchy (see manifests/queues.yaml)
make queues is a separate step on purpose: open
manifests/queues.yaml and read it first. It defines the
hierarchy the exercises use - a parent queue (ai-factory) with two leaf queues
(team-research, team-prod, quota 8 each) plus a separate gang-demo queue. The
queue is the core KAI primitive, so it is worth seeing the YAML before running anything.
โ Checkpoint: KAI's pods are Running and the queues exist:
Run make up and make queues once. Each exercise below applies its own manifest
and re-prepares its workloads (it clears the kai-demo namespace first), so you can run
them back to back. Capture evidence and uninstall at the very end, not between
exercises.
The whole loop¶
Each make step prints its own result and ends with a Verify: line - the kubectl
command to re-check that state by hand. (The exercise sections below explain each result.)
cd portfolio-lab/01-k8s-gpu-platform/kai-scheduler
make up # shared Lesson 1 fleet (kind+KWOK+fake-gpu-operator) + KAI
make queues # create the namespace + the queue hierarchy
make demo-quota # A: two teams each fill their 8-GPU quota
make demo-borrow # B: prod submits 16 (double its quota)
make demo-reclaim # C: research returns (run right after demo-borrow)
make demo-gang # D: a 10-GPU gang binds all-or-none
make evidence # snapshot queues/pods/events into evidence/<timestamp>/
make uninstall # delete the whole kind cluster (KAI + fleet + workloads)
The exercises (run in order)¶
Exercise A - quota enforcement (validated)¶
Applies manifests/exercise-a-quota.yaml: 8
single-GPU pods to team-research and 8 to team-prod (quota 8 each). Open the file
to see how a pod joins a queue (schedulerName: kai-scheduler + the
kai.scheduler/queue label).
Expect: both queues reach 8 Running, 0 Pending - neither exceeds its quota while the other is using its share. Confirm:
This is the validated result end to end: the operator advertises GPUs, KAI's webhook routes the pods, and quota is enforced per queue.
Exercise B - borrowing idle capacity¶
Applies manifests/exercise-b-borrow.yaml: leaves
team-research idle and submits 16 pods to team-prod (double its quota).
- Concept:
team-prodshould borrow the idle GPUs and run beyond its quota. - Observed on this fake fleet: it stays at ~8. KAI did not lend a sibling's
idle-but-guaranteed capacity even with
limit > quota. The script prints this. Treat it as a concept demo; the over-quota behavior needs a real multi-tenant cluster (or deeper KAI tuning) to reproduce.
Exercise C - reclaim (run immediately after B)¶
Applies manifests/exercise-c-reclaim.yaml.
Run this right after make demo-borrow, with nothing in between - it adds
team-research's pods on top of the still-running borrow workload (and errors if
prod-borrow is not present).
- Concept: when the owner returns, KAI evicts borrowed pods so
team-researchgets its guaranteed share. - Observed: because borrowing did not occur in B, there is nothing to reclaim;
team-researchsimply takes free GPUs. Inspect withkubectl get events -n kai-demo --sort-by=.lastTimestamp.
Exercise D - gang scheduling (anti-deadlock)¶
Applies manifests/exercise-d-filler.yaml to fill
the fleet down to ~8 free, then manifests/exercise-d-gang.yaml
(note the kai.scheduler/batch-min-member: "10" annotation - the all-or-none knob).
- Concept: all-or-none - the gang should stay entirely Pending until 10 GPUs are free, rather than grabbing 8 and deadlocking.
- Observed: a plain Deployment schedules as independent per-pod groups (so ~8
run), and a batch Job is gang-grouped but KWOK marks its pods
Completedinstantly, so the held state isn't observable here. The script explains this as it runs. Gang is best validated on a real cluster.
Exercise E (priority & starvation) has no
maketarget; follow Part 7 manually if you want to explore it.
Capture evidence, then tear down (at the end)¶
make evidence # snapshot queues, pods, and events into evidence/<timestamp>/
make uninstall # delete the whole kind cluster (KAI + the shared fleet + all workloads)
make evidence captures whatever is deployed at that moment, so run it right after the
exercise you want to document (for a clean record, capture Exercise A, the
validated one).
Run
make uninstalllast - after you have worked through all the exercises and read the Parts below. It deletes the entire kind cluster (KAI and the shared Lesson 1 fleet), so anything not captured withmake evidenceis gone. There is no need to tear down between exercises.
The Parts below explain the concepts and the manual manifests behind each exercise.
Part 1 - The gap you're filling¶
Re-run the baseline so the problem is fresh:
kubectl get pods -n gpu-demo -l app=queue-pressure -o wide
kubectl get pods -n gpu-demo -l app=queue-pressure --field-selector status.phase=Pending | wc -l
With the default kube-scheduler you get roughly 31 Running and 9 Pending pods
(the fleet has 32 GPUs, and scenario 1's cuda-batch-small already holds one) - in
arrival order, and that's all it can do. The default scheduler has no concept of:
| Missing capability | The question it can't answer | What it costs you |
|---|---|---|
| Queues / quota | Which team owns these GPUs when demand exceeds supply? | First-come monopolises the fleet; other teams blocked |
| Borrowing | Can team B use team A's idle GPUs right now? | Idle GPUs sit dark while jobs wait - burning money |
| Reclaim | When team A comes back, can it take its GPUs back? | Either A is starved, or B never yields - pick your pain |
| Fair-share | Who's been under-served lately and should go next? | Loud/frequent submitters crowd out everyone else |
| Gang scheduling | Can this 8-pod job get all 8 GPUs or none? | Partial allocation: 5 pods hold GPUs waiting for 3 that never come - deadlock that wastes GPUs |
| Priority | Is this a production job that should jump the queue? | Batch experiments delay revenue-serving work |
๐ก The financial framing is what matters in practice: an idle GPU is money on fire, and a partially gang-allocated distributed job is worse - it holds GPUs hostage while making no progress. Queue schedulers exist to turn the Pending pile above into policy.
Part 2 - KAI Scheduler, and what runs where¶
KAI Scheduler is NVIDIA's open-source Kubernetes scheduler (Apache-2.0, derived from Run:ai's scheduling engine) built for exactly these problems: hierarchical queues, quota with over-quota borrowing and reclaim, gang scheduling, bin-packing/spread strategies, priorities, and GPU sharing concepts.
How it coexists with the fake fleet:
KAI components (real pods) Workload pods (target fake nodes)
โโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
scheduler / podgrouper / etc. cuda-* pods with schedulerName: kai
run on the REAL kind worker โโโโโโโถ bound by KAI onto kwok-gpu-* fake nodes
(they're just controllers) (KWOK simulates them reaching Running)
๐ก Why this is sound: KAI's own pods are ordinary controllers - they run on the
real worker node and make decisions. The workloads they schedule carry
schedulerName: <kai> and a queue label, and get bound onto the fake GPU nodes. The
binding, the gang check, the quota math, the reclaim/eviction - all happen at the API
level, which is exactly what KWOK serves. The containers never execute, but the
scheduling decisions are real.
โ ๏ธ Don't copy the manifests below blindly. KAI Scheduler is actively evolving. The exact Helm chart name/values, the
QueueCRDapiVersion/fields, and the queue/priority label keys change between releases. Every manifest in this lesson is marked ILLUSTRATIVE and shows the shape of the concept, not a guaranteed copy-paste. Confirm exact names against the official repo and docs at the time you run this: https://github.com/NVIDIA/KAI-Scheduler When you run it for real, the manifests you actually used and the output you captured go into../../06-validation-reports/.
Install (pattern)¶
# ILLUSTRATIVE - get the current chart name, repo URL, and version from the official
# KAI Scheduler install docs. Do not assume these values are current.
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia # verify exact repo/chart in docs
helm repo update
helm install kai-scheduler <official-chart> -n kai-scheduler --create-namespace
โ
Checkpoint: KAI's controller pods are Running in their namespace:
kubectl get pods -n kai-scheduler. You haven't scheduled any workload with it yet -
that's the next parts.
Part 3 - Exercise A: two queues with quota (enforcement)¶
Goal: prove that quota is enforced - a queue cannot exceed its deserved share while another queue wants its own.
- Create two queues with quotas that sum to the fleet (32 fake GPUs). For example
team-research= 16,team-prod= 16.
# ILLUSTRATIVE Queue shape - confirm apiVersion/kind/field names in KAI docs.
# The idea: a queue with a "deserved"/guaranteed GPU quota. Real field names vary.
apiVersion: <kai-queue-apiVersion>
kind: Queue
metadata:
name: team-research
spec:
resources:
gpu:
quota: 16 # deserved/guaranteed share (confirm field name)
overQuotaWeight: 1 # used when borrowing idle capacity (confirm field name)
- Submit ~16 single-GPU pods to each queue. Point each at KAI and tag the queue:
# ILLUSTRATIVE pod spec fragment - confirm the queue label KEY in KAI docs.
spec:
schedulerName: <kai-scheduler-name>
# KAI associates a pod to a queue via a label; the exact key changes by release.
# e.g. metadata.labels["<kai-queue-label-key>"]: team-research
You can adapt ../workloads/gpu-deployment-queue-pressure.yaml
by duplicating it per team, setting replicas: 16, adding schedulerName and the
queue label, and keeping the existing KWOK toleration.
โ
Checkpoint: both queues run ~16 pods; neither exceeds its quota while the other
is full. Capture kubectl get pods -n gpu-demo -o wide and the queue status objects.
๐ฌ Proved on fake GPUs: quota enforcement is a control-plane decision - fully valid here. Not proved: that 16 real GPUs would physically serve the work.
Part 4 - Exercise B: borrowing (utilisation)¶
Goal: show idle capacity being lent out, instead of sitting dark.
- Leave
team-researchidle (submit nothing). - Submit ~32 single-GPU pods to
team-prod(double its 16 quota).
โ
Checkpoint: team-prod runs more than its 16-GPU quota - it borrows
team-research's idle 16 and approaches 32 Running. Capture the pod list showing
team-prod over quota.
๐ก Why this is the money-saver: without borrowing, half the fleet would idle while
team-prod jobs waited. Borrowing is how queue schedulers keep expensive GPUs busy
and preserve ownership.
๐ฌ Proved on fake GPUs: the borrowing decision and the resulting placement. The scheduler genuinely allocates beyond quota into idle capacity.
Part 5 - Exercise C: reclaim (the hard part)¶
Goal: the owner returns and takes its GPUs back - this is where naive systems fail.
- With
team-prodstill over quota (borrowing), now submit ~16 pods toteam-research.
โ
Checkpoint: KAI reclaims the borrowed GPUs - some team-prod pods are
evicted back to Pending so team-research can reach its guaranteed 16. Capture
the before (prod over quota) and after (prod reclaimed down to ~16, research at ~16)
states, plus the eviction events:
kubectl get events -n gpu-demo --sort-by=.lastTimestamp.
๐ก Why reclaim is the whole point: borrowing without reclaim is just over-commit - the owner gets starved. Reclaim is what makes "borrow idle GPUs" safe: you can lend freely because you can take it back. This is the single most important dynamic in multi-team GPU scheduling, and you just reproduced it with zero GPUs.
๐ฌ Proved on fake GPUs: the reclaim/preemption decision and eviction. Not proved: graceful checkpoint/restart of a real training job mid-eviction (that's a workload-runtime concern, not a scheduler one).
Part 6 - Exercise D: gang scheduling (anti-deadlock)¶
Goal: prove all-or-nothing placement, the feature that keeps distributed training from deadlocking a cluster.
Setup the trap first (default scheduler): submit a single "job" of 10 pods ร 1 GPU into a fleet that only has, say, 8 free GPUs, using the default scheduler. You get 8 pods Running holding GPUs, 2 Pending forever - the job makes zero progress yet occupies 8 GPUs. That's the deadlock.
Now with KAI gang scheduling: submit the same 10-pod job as a single gang (KAI groups the pods of a workload and requires a minimum-member count to bind together).
# ILLUSTRATIVE - KAI groups pods into a "pod group" with a minimum member count.
# The mechanism (a PodGroup-like object, or annotations the pod-grouper reads) and
# its exact fields change by release. Confirm against KAI docs.
# Concept: minMember: 10 โ bind all 10 or none.
โ Checkpoint: the 10-pod gang stays entirely Pending while only 8 GPUs are free - it does not grab the 8 and block. Free up capacity (delete other pods) and the whole gang schedules together. Capture both states.
๐ก Why this is gold to learn for free: gang scheduling bugs in the real world cost enormous GPU hours - a 64-GPU job half-allocated wastes 32 GPUs indefinitely. Here you see the all-or-nothing logic directly, on fake nodes, in seconds.
๐ฌ Proved on fake GPUs: the gang admission decision (bind-all-or-none) and the anti-deadlock behaviour - fully control-plane. Not proved: the actual NCCL all-reduce the gang would run once placed (that needs real GPUs + NVLink/network - Lesson 6 territory, and even there, single-node).
Part 7 - Exercise E: priority & starvation¶
Goal: reproduce a low-priority queue being starved, then fix it.
- Flood the fleet with high-priority
team-prodwork soteam-research(lower priority, no guaranteed quota in this variant) gets nothing for a while. - Observe
team-researchpods Pending indefinitely - starvation. - Resolve it: give
team-researcha guaranteed quota (so reclaim protects it) or adjust fair-share/priority so it eventually gets a turn.
โ Checkpoint: you can cause starvation on demand and then eliminate it with a quota/priority change, capturing both. That's the exact muscle the queue-starvation runbook exercises.
๐ฌ Proved on fake GPUs: starvation detection and the fair-share/priority response - all scheduler logic.
What you can and cannot learn here - the precise line¶
| Capability | Learnable on the fake fleet? | Why |
|---|---|---|
| Hierarchical queues & quota enforcement | โ Yes | Pure control-plane bookkeeping |
| Over-quota borrowing | โ Yes | Placement decision over integers |
| Reclaim / preemption | โ Yes | Eviction is an API operation |
| Fair-share ordering | โ Yes | Scheduler-internal accounting |
| Gang scheduling (all-or-none) | โ Yes | Admission decision before binding |
| Priority & starvation dynamics | โ Yes | Ordering logic |
| Bin-pack vs spread placement | โ Yes | Node-selection strategy |
| GPU sharing / fractional GPUs (scheduling view) | โ ๏ธ Partly | KAI's bookkeeping of fractions is visible; runtime memory isolation is NOT - Lesson 1C (HAMi) is where you prove enforcement on real hardware |
| MIG partitioning | โ No | Requires real GPU + driver |
| Actual CUDA / NCCL the gang would run | โ No | Requires real GPUs (and, for scale, real network) |
๐ก The pattern is consistent with the whole course: decisions are learnable on fakes; execution and isolation need real hardware. KAI happens to be almost entirely about decisions - which is why it's such a high-value, low-cost thing to study here.
Evidence to capture (lab notebook)¶
For each exercise, snapshot into your lab notebook:
kubectl get pods -n gpu-demo -o wide(before/after states)- the
Queuestatus objects (kubectl get queue -o yamlor the equivalent KAI lists) kubectl get events -n gpu-demo --sort-by=.lastTimestamp(borrow/reclaim/gang/evict)- the exact manifests you applied (since the illustrative ones above are not authoritative)
A claim like "I demonstrated quota borrowing and reclaim between two queues" is only
backed once those captures exist. See
fake-vs-real-limitations.md.
Operational takeaways¶
- Default kube-scheduler vs queue-based AI schedulers: the six gaps in Part 1 are the reason platforms like KAI exist; each maps to real money.
- Quota-with-borrowing vs hard quota: the utilisation-vs-predictability trade-off. Borrowing maximises utilisation; reclaim is the safety valve that makes it acceptable.
- Gang scheduling: non-negotiable for distributed training - without it, large jobs deadlock clusters.
๐ Related runbook: kai-scheduler-queue-starvation.md.
โก๏ธ Back to: Lesson 1 ยท Next: Lesson 1C - GPU sharing & fractional GPUs with HAMi (the sharing concepts are free; its hands-on isolation half runs in the Lesson 6 rental), and eventually Lesson 6 - Real GPU, where you finally run something below the kubelet on real hardware.