AI Factory Operations Lab¶

A hands-on course in AI/HPC GPU infrastructure operations. You do not read this course, you run it: stand things up, break them on purpose, diagnose them the way you would on a real cluster, and capture the evidence. Most of it needs no GPU at all; one optional session uses a single cheap rented GPU and is clearly marked.

Get started Browse on GitHub

The scope boundary¶

Every lesson declares one of two modes and states exactly what it proves and what it does not:

Simulation (no GPU). kind + KWOK fake nodes, the fake-gpu-operator, Slurm with fake GRES. Proves control-plane behaviour: scheduling, queueing, sharing decisions, triage. Nothing below the kubelet.
Real GPU (one cheap NVIDIA GPU). Real driver, container toolkit, CUDA pod, DCGM telemetry, enforced GPU sharing. Proves the runtime path, single-node.

Knowing exactly where that line sits is itself one of the skills this course teaches.

What it costs¶

Tier	Lessons	You pay	You get
$0 simulation	0, 1, 1B, 1C, 2, 3, 4, 5	Nothing, a laptop runs it	Scheduling, queueing, GPU-sharing decisions, triage, observability design, lifecycle - most of the course
$5-10 one GPU session	6 (the real-GPU capstone)	A few hours on one entry-level GPU VM	The real runtime path, enforced sharing, real telemetry and benchmarks

The lessons¶

1 - Kubernetes GPU scheduling

Build a fake GPU fleet with kind + KWOK and diagnose why GPU pods stay Pending.

Start here
1B - Queue scheduling (KAI)

Install NVIDIA's KAI Scheduler on a fake fleet and enforce per-team queue quota.

Open
1C - GPU sharing (HAMi)

Fractional GPUs: schedule slices on fakes, then prove memory isolation on one real GPU.

Open
2 - Real GPU validation

Prove the full driver to toolkit to device-plugin to pod path on real hardware.

Open
3 - Slurm workload management

A Slurm-in-Docker cluster with fake GRES: GPU jobs, QoS caps, queue pressure, drain/resume.

Open
4 - GPU observability

Prometheus/Grafana over synthetic DCGM metrics; build dashboards and trip alerts on purpose.

Open
5 - Inference serving

A load harness for TTFT, p95/p99, tokens-per-sec; $0 CPU tier, real numbers on a GPU.

Open
6 - Cluster lifecycle

A runnable provision to health-gate to patch to retire node-lifecycle drill, mapped to BCM.

Open

Run the first loop¶

git clone https://github.com/ld-singh/ai-factory-ops-lab
cd ai-factory-ops-lab
make check          # verify docker, kind, kubectl, helm, jq
make phase1-up      # kind cluster + KWOK + fake GPU node pools
make phase1-demo    # schedulable + intentionally-Pending GPU workloads
make phase1-down    # tear it down

⭐ Finding this useful? Star it on GitHub — it helps other engineers find the course.

Built by Lovedeep Singh — Cloud Infrastructure Architect (AWS, Azure, Kubernetes & DevSecOps), building secure, governed cloud platforms. See About for more.