Lesson 1 - Kubernetes GPU Scheduling¶
Get the code to run this lab
The commands on this page come from the repository, not the website. Clone it and enter this lesson's folder: git clone https://github.com/ld-singh/ai-factory-ops-lab && cd ai-factory-ops-lab/portfolio-lab/01-k8s-gpu-platform. Browse this lesson on GitHub
Course home: AI Factory Operations Lab ยท Previous: Lesson 0 - Orientation ยท Next: Lesson 1B - KAI Scheduler
In this lesson you build a heterogeneous GPU fleet that has no GPUs in it, run
real workloads against it, and learn to diagnose why GPU pods get stuck Pending -
using the exact same kubectl workflow you'd use on a production cluster.
๐ฏ Learning objectives - after this lesson you can:
- Explain why
nvidia.com/gpuis just an integer to the default Kubernetes scheduler, and why that makes fake GPU nodes a legitimate way to study scheduling. - Model a heterogeneous fleet (A100/H100/L40S pools) with node labels, taints, and GFD-style product labels.
- Deploy a schedulable GPU workload and watch it land on the right pool.
- Diagnose two different Pending causes - capacity mismatch vs fleet mismatch -
from
kubectl describeand Events alone. - Reproduce queue pressure (more demand than GPUs) and explain why the default scheduler can't solve it - then go on to Lesson 1B and actually solve it with queues, quota, borrowing, reclaim, and gang scheduling, all on the fake fleet.
๐งญ Mode: ๐ฆ Simulation (no GPU). The real-hardware half of this module is Lesson 6.
๐ Prerequisites: Lesson 0
complete (make check passes).
This module has two halves, and the line between them is the whole point:
| Half | Mode | GPU required | Lesson |
|---|---|---|---|
kind/, kwok/, workloads/, kai-scheduler/, fake-gpu-operator/ |
๐ฆ Control-plane simulation | No | This lesson + 1B |
hami/ |
๐ฆ+๐ฅ Split - sharing concepts free, isolation needs the Lesson 6 GPU | Optional | Lesson 1C |
gpu-operator-real/ |
๐ฅ Real GPU runtime validation | Yes (one NVIDIA GPU) | Lesson 6 |
The big idea (read before you run anything)¶
๐ก The default Kubernetes scheduler never talks to a GPU. It compares integer resource
requests against integer node allocatable values. A node advertising
nvidia.com/gpu: 8 exercises the identical scheduling code path whether those 8
GPUs are real silicon or a number advertised onto a fake node. That's why this
entire lesson works on a laptop - and exactly why it can't tell you anything about
CUDA, NVLink, or GPU memory. Hold onto that distinction; it comes back in every
"what you proved" box.
Two complementary pieces build the fake fleet:
- KWOK creates the nodes - pure API objects with no kubelet, so you can stamp out a heterogeneous fleet (or thousands of nodes) for free.
- run.ai's fake-gpu-operator provides the GPU layer on those KWOK nodes: it
advertises
nvidia.com/gpufrom a per-pool topology via a device plugin (operator- shaped, like production), and stands up a per-node DCGM exporter emitting syntheticDCGM_FI_*metrics with per-pod attribution. They are complementary, not alternatives: KWOK = nodes, fake-gpu-operator = GPUs on those nodes. The same fake-GPU mechanism carries through Lessons 1B, 1C, and 3.
Everything is still synthetic: no kubelet, driver, or CUDA, and the DCGM metrics are fabricated. It proves the control plane and the observability pipeline shape.
Deep-dive pages: kwok/README.md (why fake nodes are legitimate) and fake-gpu-operator/README.md (the GPU layer).
Step 1 - Stand up the simulated fleet¶
This runs four scripts in order: create the kind cluster, install KWOK, install the fake-gpu-operator (3 pools), then stamp out the fake GPU node pools.
๐ก Why: setup-kind.sh gives you a real
Kubernetes control plane (so the default scheduler logic is real); install-kwok.sh
adds KWOK, which lets fake nodes join and have their pod lifecycle simulated;
install-fake-gpu-operator.sh installs the
GPU layer with the a100/h100/l40s topology; and
create-fake-gpu-nodes.sh stamps out the nodes
(labeled into pools), which the operator then advertises GPUs onto. See
kind/README.md, kwok/README.md, and
fake-gpu-operator/README.md for the details of each.
The simulated fleet you just built:
| Pool | Nodes | GPUs/node | Product label (GFD-style) |
|---|---|---|---|
a100 |
2 | 8 | NVIDIA-A100-SXM4-80GB |
h100 |
1 | 8 | NVIDIA-H100-80GB-HBM3 |
l40s |
2 | 4 | NVIDIA-L40S |
Total simulated fleet: 5 nodes, 32 "GPUs".
โ Checkpoint: you can see the fleet, with product labels and GPU counts:
You should see five kwok-gpu-* nodes across the three pools, each reporting its
product label. If they're missing, re-run make phase1-up (it's idempotent).
๐ก Optional - peek at the synthetic GPU metrics. The operator runs a DCGM exporter per node. These are fabricated values, but the metric names and labels are real (the foundation Lesson 3 builds dashboards on):
kubectl -n gpu-operator port-forward svc/nvidia-dcgm-exporter 9400:9400 &
curl -s localhost:9400/metrics | grep -E 'DCGM_FI_DEV_GPU_UTIL' | head
Step 2 - Deploy the four scenarios¶
This applies four deliberately-chosen workloads into the gpu-demo namespace.
Three are designed to teach you something specific:
| Scenario | Workload | Designed outcome |
|---|---|---|
| 1 - Schedulable | cuda-batch-small |
Running on an A100 node (1 GPU fits) |
| 2 - Capacity mismatch | cuda-train-16gpu |
Pending - asks for 16 GPUs; no node has more than 8 |
| 3 - Fleet mismatch | cuda-needs-b200 |
Pending - nodeSelector targets a b200 pool that doesn't exist |
| 4 - Queue pressure | queue-pressure |
40 replicas ร 1 GPU vs 32 GPUs โ ~32 Running, rest Pending |
๐ก Why these four: scenario 1 proves placement works; scenarios 2 and 3 are the two most common real-world Pending causes (asked for more than exists vs asked for a pool that doesn't exist), and they produce different events; scenario 4 is the raw material for the queueing discussion in Lesson 1's KAI section.
โ Checkpoint: one pod Running, the two single Pending pods Pending, and the deployment partially scheduled:
Step 3 - Triage like it's a real cluster¶
This is the skill the lesson exists for. Inspect the fleet and the stuck pods with the same commands you'd use on production:
kubectl get nodes -L nvidia.com/gpu.product -L gpu-pool # fleet at a glance
kubectl describe node kwok-gpu-a100-0 # one node in detail
kubectl get pods -n gpu-demo -o wide # who's Running vs Pending
kubectl get events -n gpu-demo --sort-by=.lastTimestamp # the default scheduler's reasoning
Now diagnose each Pending pod yourself:
kubectl describe pod -n gpu-demo cuda-train-16gpu # scenario 2
kubectl describe pod -n gpu-demo cuda-needs-b200 # scenario 3
๐ก Why the events differ: read the Events: section at the bottom of each
describe. Scenario 2 fails on Insufficient nvidia.com/gpu - the default scheduler
found candidate nodes but none had enough GPUs. Scenario 3 fails on
node(s) didn't match Pod's node affinity/selector - the default scheduler rejected
every node before even checking GPU counts, because the gpu-pool: b200 selector
matched nothing. Same symptom (Pending), completely different root cause and fix.
โ
Checkpoint - predict, then verify. Before reading each describe, write down
which of the two failure reasons you expect. You understand the lesson when your
prediction matches the event every time. Specifically you should observe:
cuda-batch-smallโ Running onkwok-gpu-a100-0or-1.cuda-train-16gpuโ Pending,Insufficient nvidia.com/gpu(deliberate capacity mismatch - no single node exposes 16 GPUs).cuda-needs-b200โ Pending, selector/affinity mismatch (deliberate fleet mismatch - theb200pool was never created).queue-pressureโ some replicas Running, the rest Pending. Count them:kubectl get pods -n gpu-demo -l app=queue-pressure | grep -c Runningshould be about 32 (the fleet's total GPU count), the rest Pending.
Step 4 - Capture evidence¶
๐ก Why: collect-k8s-evidence.sh
snapshots node, pod, and event state into a timestamped directory under
../06-validation-reports/evidence/. In ops work, "I
saw it happen" doesn't count - the captured artifact does. This is also how the
rule that keeps "Complete" meaningful: a lesson counts as done only once its report
holds real output.
โ
Checkpoint: fill in
../06-validation-reports/local-simulation-report.md
with your environment details and a reference to the evidence directory you just
produced.
Step 5 - Tear down¶
โ
Checkpoint: kind get clusters no longer lists ai-factory-lab.
๐ฌ What this lesson proved - and did NOT¶
Proved (simulation): - GPU-aware scheduling and placement across heterogeneous pools - The two canonical Pending root causes and how to tell them apart - Capacity contention / queue-pressure behaviour (more requests than GPUs) - Fleet modelling: labels, taints, pool design for A100/H100/L40S-class nodes - The Pending-pod triage workflow - identical to the one used on real clusters - Operator-shaped GPU advertisement (a device plugin, not a hand-written integer) - A synthetic DCGM metrics stream with per-pod attribution (the Lesson 3 bridge)
Did NOT prove: no CUDA execution, no NCCL, no NVLink/NVSwitch, no MIG, no
GPUDirect RDMA, no real GPU memory behaviour. The DCGM metrics here are fabricated
by the operator (useful for dashboard/alert design, not real telemetry), and the
containers on KWOK nodes never actually run. Real telemetry and the runtime path
belong to Lesson 6 and only count once captured in
../06-validation-reports/real-gpu-validation-report.md.
The full ledger: fake-vs-real-limitations.md.
โญ Continue to Lesson 1B - solve the queue-pressure mess¶
Step 3's scenario 4 leaves you with a pile of Pending pods and a default scheduler that has no answer. Lesson 1B - Queue-Based GPU Scheduling with KAI Scheduler is where you fix it: hierarchical queues, quota, over-quota borrowing, reclaim, gang scheduling, and starvation control - and the headline is that all of it is learnable on the fake fleet, because queue policy and gang scheduling are pure control-plane decisions. It's the highest-value, lowest-cost thing in the whole course. Do it before moving to Lesson 6.
Go deeper (optional sub-pages)¶
These expand on parts of the lesson. Read them when the corresponding step makes you curious:
- kind/ - the local cluster, and kind vs k3d.
- kwok/ - how fake GPU nodes are built and why it's legitimate.
- fake-gpu-operator/ - the GPU layer on the KWOK
nodes (installed by
phase1-up): advertisesnvidia.com/gpuand emits synthetic DCGM metrics (the Lesson 3 bridge). - hami/ - Lesson 1C: GPU sharing and fractional GPUs (time-slicing vs MPS vs MIG vs HAMi), with a real-hardware part that splits one GPU between pods.
- gpu-operator-real/ - Lesson 6: prove the real GPU path on actual hardware.
Directory guide¶
kind/- kind cluster config (control plane + one real worker for system pods)kwok/- KWOK installation notes and fake GPU node manifests/templatesfake-gpu-operator/- the GPU layer (advertises GPUs + DCGM metrics on KWOK nodes)kai-scheduler/- Lesson 1B: queue/quota scheduling concepts and KAI Scheduler noteshami/- Lesson 1C: GPU sharing / fractional GPUs with HAMiworkloads/- the four demo workloads (schedulable, two Pending, queue pressure)gpu-operator-real/- Lesson 6: real GPU validation guidescripts/- setup and demo automation
โก๏ธ Next: Lesson 1B - Queue-Based GPU Scheduling with KAI Scheduler, where you turn the queue-pressure pile into policy - quota, borrowing, reclaim, and gang scheduling - all on this same fake fleet. Then Lesson 1C - GPU sharing with HAMi (concepts free; its hands-on part piggybacks on the Lesson 6 rental), and Lesson 6 runs the manifests on real hardware.