Lesson 6 (Part A) - Real GPU runtime path + evidence¶
Get the code to run this lab
The commands on this page come from the repository, not the website. Clone it and enter this lesson's folder: git clone https://github.com/ld-singh/ai-factory-ops-lab && cd ai-factory-ops-lab/portfolio-lab/01-k8s-gpu-platform/gpu-operator-real. Browse this lesson on GitHub
Part of Lesson 6 - Real GPU Β· Course home: AI Factory Operations Lab
Start at the Lesson 6 hub for the full session order (host setup β this runtime path β HAMi sharing β inference β Slurm GRES β teardown).
This is the real-hardware foundation of the course. In Lesson 1 you deliberately could not prove anything below the kubelet. Here you prove the whole thing - the complete GPU path to a running pod - on actual silicon, and capture the evidence.
β Validated on real hardware - the captured run is in
real-gpu-validation-report.md. The steps below reproduce it on a fresh GPU VM.
π― After this part you can:
- Stand up the full GPU path link by link: driver β container toolkit β runtime class β device plugin (GPU Operator) β kubelet β scheduler β CUDA container.
- Confirm the node advertises real
nvidia.com/gpuand discovered GFD labels, and contrast them with the script-written labels from Lesson 1. - Run a CUDA pod that executes
nvidia-smion a real GPU - the single most important artifact in the course. - Pull real DCGM telemetry, the foundation Lesson 3 builds on.
π§ Mode: π₯ Real GPU. One machine with an NVIDIA card - a bare GPU VM you get root on (Hyperstack, Lambda, hyperscaler) or a local GPU box. An L4 (24 GB) or RTX A6000 (48 GB) is plenty; never an A100/H100. Marketplace containers (Vast.ai/RunPod pods) don't work - they can't install the toolkit + k3s.
π Prerequisites: Lesson 1 done (you know what the simulation did and didn't prove), and a budget of $5-10 for the VM.
The iron rule: tear the VM down the moment evidence is captured. The evidence directory is the deliverable; the VM has no residual value. Delete the boot/storage volume too if it's billed separately.
What this validates (that simulation cannot)¶
The full GPU path to a pod, link by link, plus real telemetry:
NVIDIA driver β NVIDIA Container Toolkit β containerd runtime class
β NVIDIA device plugin (GPU Operator) β kubelet β scheduler β CUDA container β DCGM
There are two ways to run it: the scripted path (what produced the validated evidence above) or the manual steps (do it by hand to understand each link). Both end at the same evidence.
Option 1 - Scripted path (fast)¶
βΉοΈ Already did the Lesson 6 setup? Those scripts already do host setup β kubeconfig β
install-gpu-operator.sh, so the cluster and GPU Operator are already up. Don't repeat the install block below - skip straight to Capture evidence.
If you haven't set the host up yet, the
real-gpu-session/scripts/ directory automates
it (read scripts/README.md first):
Run it all on the GPU VM (SSH in). Clone the repo, then run from the repo root:
# same as the Lesson 6 setup - SKIP if you already ran them
git clone https://github.com/ld-singh/ai-factory-ops-lab.git
cd ai-factory-ops-lab
sudo PUBLIC_IP=<vm-ip> bash portfolio-lab/real-gpu-session/scripts/host-setup.sh # NVIDIA toolkit + k3s + API cert
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
portfolio-lab/real-gpu-session/scripts/install-gpu-operator.sh # GPU Operator + DCGM + a CUDA smoke pod
Prefer to drive
kubectlfrom your laptop instead? Open TCP 6443 to the VM and runportfolio-lab/real-gpu-session/scripts/fetch-kubeconfig.sh <user>@<vm-ip> --key <key>, thenexport KUBECONFIG=$PWD/kubeconfig-gpuvm. Running on the VM (above) avoids a flaky laptopβAPI link.
Capture evidence¶
Whether the operator went up just now or during setup, this is all Part A still needs. Run on the VM - it writes a tarball you scp back:
# on the VM, from the repo root:
portfolio-lab/real-gpu-session/scripts/capture-evidence.sh
# then from your laptop, pull the tarball back:
scp -i <key> <user>@<vm-ip>:~/gpu-evidence-*.tgz \
./portfolio-lab/06-validation-reports/evidence/
capture-evidence.sh snapshots host + in-pod nvidia-smi, nvidia.com/gpu allocatable,
GFD labels, the GPU Operator pods, and real DCGM metrics - the full Part A evidence set.
β
Gate: nvidia-smi from inside a scheduled pod, and real DCGM_FI_* metrics whose
values match nvidia-smi.
Option 2 - Manual steps (understand each link)¶
Same result, by hand. Each step has a Pass criteria - its checkpoint.
βΉοΈ Operator already up (from the Lesson 6 setup)? Then don't reinstall - reinstalling on top of a working operator only risks breaking it. Use the steps below purely to inspect each link (the
kubectl/nvidia-smichecks), then capture evidence.VERSION WARNING: package names, image tags, and Helm chart versions drift with driver/CUDA/Kubernetes releases. Treat each command as a validated pattern and cross-check the linked official docs before running.
Step 1 - Driver¶
Most "deep learning / GPU" images ship the driver. Otherwise install per the NVIDIA driver guide.
nvidia-smi # driver/CUDA version, GPU model
nvidia-smi -L # GPU inventory with UUIDs
nvidia-smi topo -m # topology (single GPU: trivial, capture anyway)
Pass: nvidia-smi lists the GPU without errors.
Step 2 - NVIDIA Container Toolkit¶
Install + configure the runtime (nvidia-ctk runtime configure) per the
toolkit install guide.
On a Docker host you can prove the runtime injection independent of Kubernetes:
Pass: in-container nvidia-smi matches the host. (k3s uses containerd, not Docker, so
this Docker check is optional - the CUDA pod in Step 5 proves the same path.)
Step 3 - Single-node Kubernetes¶
k3s is fastest (it bundles containerd; the GPU Operator
supports it). host-setup.sh uses k3s with --tls-san <public-ip> so you can drive it
from your laptop.
Pass: kubectl get nodes is Ready.
Step 4 - NVIDIA GPU Operator¶
Use the official getting-started docs for the current chart version. Pattern (driver pre-installed on the host):
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
helm install gpu-operator nvidia/gpu-operator \
-n gpu-operator --create-namespace \
--set driver.enabled=false # host driver already present
kubectl get pods -n gpu-operator # all Running/Completed
kubectl get node -o jsonpath='{.items[0].status.allocatable.nvidia\.com/gpu}'; echo
kubectl describe node | grep nvidia.com/gpu # GFD labels, now REAL
Pass: node advertises nvidia.com/gpu β₯ 1 and GFD labels match the real GPU - same
label names as the simulation, now with real provenance.
Step 5 - CUDA test pod¶
kubectl run cuda-test --rm -it --restart=Never \
--image=nvidia/cuda:12.4.1-base-ubuntu22.04 \
--overrides='{"spec":{"runtimeClassName":"nvidia"}}' \
--limits=nvidia.com/gpu=1 -- nvidia-smi
Pass: nvidia-smi from inside a scheduled pod. The complete path, proven.
Step 6 - DCGM Exporter¶
The GPU Operator deploys it. The exporter container usually has no curl, so scrape via port-forward from the host:
kubectl -n gpu-operator port-forward <dcgm-exporter-pod> 9400:9400 &
curl -s localhost:9400/metrics | grep -E 'DCGM_FI_DEV_(GPU_UTIL|FB_USED|GPU_TEMP|POWER_USAGE)'
Pass: real DCGM_FI_* values that match nvidia-smi (temp, power, FB used).
πΈ Then capture everything with
capture-evidence.sh and record the
versions into
real-gpu-validation-report.md.
Next parts (same rental)¶
Stay on this GPU and continue the Lesson 6 session:
- Part B - HAMi GPU sharing: turn this one card into enforced slices and prove multi-pod co-residency - the highest-value add-on.
- Part C - inference benchmark (vLLM): real TTFT / latency / tokens-per-second on one mid-range GPU.
- Part D - Slurm real
--gres=gpu: the real counterpart to the fake-GRES Slurm lesson.
π¬ What this part proves - and does NOT¶
Proves: the real, end-to-end GPU runtime path on one node, plus real DCGM telemetry - exactly what Lesson 1's simulation could not.
Does NOT prove: single-node by design - no NCCL collective performance, no
NVLink/NVSwitch topology, no GPUDirect RDMA, no multi-node training, nothing about
production-scale fleet ops. It proves the path and telemetry, not scale. Full ledger:
fake-vs-real-limitations.md.
β‘οΈ Next: back to the Lesson 6 hub for Part B - HAMi sharing on this same GPU.