AI Factory Operations Lab¶
A hands-on course in AI/HPC GPU infrastructure operations. You do not read this course, you run it: stand things up, break them on purpose, diagnose them the way you would on a real cluster, and capture the evidence. Most of it needs no GPU at all; one optional session uses a single cheap rented GPU and is clearly marked.
The scope boundary¶
Every lesson declares one of two modes and states exactly what it proves and what it does not:
- Simulation (no GPU). kind + KWOK fake nodes, the fake-gpu-operator, Slurm with fake GRES. Proves control-plane behaviour: scheduling, queueing, sharing decisions, triage. Nothing below the kubelet.
- Real GPU (one cheap NVIDIA GPU). Real driver, container toolkit, CUDA pod, DCGM telemetry, enforced GPU sharing. Proves the runtime path, single-node.
Knowing exactly where that line sits is itself one of the skills this course teaches.
What it costs¶
| Tier | Lessons | You pay | You get |
|---|---|---|---|
| $0 simulation | 0, 1, 1B, 1C, 2, 3, 4, 5 | Nothing, a laptop runs it | Scheduling, queueing, GPU-sharing decisions, triage, observability design, lifecycle - most of the course |
| $5-10 one GPU session | 6 (the real-GPU capstone) | A few hours on one entry-level GPU VM | The real runtime path, enforced sharing, real telemetry and benchmarks |
The lessons¶
-
1 - Kubernetes GPU scheduling
Build a fake GPU fleet with kind + KWOK and diagnose why GPU pods stay Pending.
-
1B - Queue scheduling (KAI)
Install NVIDIA's KAI Scheduler on a fake fleet and enforce per-team queue quota.
-
1C - GPU sharing (HAMi)
Fractional GPUs: schedule slices on fakes, then prove memory isolation on one real GPU.
-
2 - Real GPU validation
Prove the full driver to toolkit to device-plugin to pod path on real hardware.
-
3 - Slurm workload management
A Slurm-in-Docker cluster with fake GRES: GPU jobs, QoS caps, queue pressure, drain/resume.
-
4 - GPU observability
Prometheus/Grafana over synthetic DCGM metrics; build dashboards and trip alerts on purpose.
-
5 - Inference serving
A load harness for TTFT, p95/p99, tokens-per-sec; $0 CPU tier, real numbers on a GPU.
-
6 - Cluster lifecycle
A runnable provision to health-gate to patch to retire node-lifecycle drill, mapped to BCM.
Run the first loop¶
git clone https://github.com/ld-singh/ai-factory-ops-lab
cd ai-factory-ops-lab
make check # verify docker, kind, kubectl, helm, jq
make phase1-up # kind cluster + KWOK + fake GPU node pools
make phase1-demo # schedulable + intentionally-Pending GPU workloads
make phase1-down # tear it down
⭐ Finding this useful? Star it on GitHub — it helps other engineers find the course.
Built by Lovedeep Singh — Cloud Infrastructure Architect (AWS, Azure, Kubernetes & DevSecOps), building secure, governed cloud platforms. See About for more.