Lesson 6 (Part A) - Real GPU runtime path + evidence¶

Get the code to run this lab

The commands on this page come from the repository, not the website. Clone it and enter this lesson's folder: git clone https://github.com/ld-singh/ai-factory-ops-lab && cd ai-factory-ops-lab/portfolio-lab/01-k8s-gpu-platform/gpu-operator-real. Browse this lesson on GitHub

Part of Lesson 6 - Real GPU · Course home: AI Factory Operations Lab

Start at the Lesson 6 hub for the full session order (host setup → this runtime path → HAMi sharing → inference → Slurm GRES → teardown).

This is the real-hardware foundation of the course. In Lesson 1 you deliberately could not prove anything below the kubelet. Here you prove the whole thing - the complete GPU path to a running pod - on actual silicon, and capture the evidence.

✅ Validated on real hardware - the captured run is in real-gpu-validation-report.md. The steps below reproduce it on a fresh GPU VM.

🎯 After this part you can:

Stand up the full GPU path link by link: driver → container toolkit → runtime class → device plugin (GPU Operator) → kubelet → scheduler → CUDA container.
Confirm the node advertises real nvidia.com/gpu and discovered GFD labels, and contrast them with the script-written labels from Lesson 1.
Run a CUDA pod that executes nvidia-smi on a real GPU - the single most important artifact in the course.
Pull real DCGM telemetry, the foundation Lesson 3 builds on.

🧭 Mode: 🟥 Real GPU. One machine with an NVIDIA card - a bare GPU VM you get root on (Hyperstack, Lambda, hyperscaler) or a local GPU box. An L4 (24 GB) or RTX A6000 (48 GB) is plenty; never an A100/H100. Marketplace containers (Vast.ai/RunPod pods) don't work - they can't install the toolkit + k3s.

📋 Prerequisites: Lesson 1 done (you know what the simulation did and didn't prove), and a budget of $5-10 for the VM.

The iron rule: tear the VM down the moment evidence is captured. The evidence directory is the deliverable; the VM has no residual value. Delete the boot/storage volume too if it's billed separately.

What this validates (that simulation cannot)¶

The full GPU path to a pod, link by link, plus real telemetry:

NVIDIA driver → NVIDIA Container Toolkit → containerd runtime class
→ NVIDIA device plugin (GPU Operator) → kubelet → scheduler → CUDA container → DCGM

There are two ways to run it: the scripted path (what produced the validated evidence above) or the manual steps (do it by hand to understand each link). Both end at the same evidence.

Option 1 - Scripted path (fast)¶

ℹ️ Already did the Lesson 6 setup? Those scripts already do host setup → kubeconfig → install-gpu-operator.sh, so the cluster and GPU Operator are already up. Don't repeat the install block below - skip straight to Capture evidence.

If you haven't set the host up yet, the real-gpu-session/scripts/ directory automates it (read scripts/README.md first):

Run it all on the GPU VM (SSH in). Clone the repo, then run from the repo root:

# same as the Lesson 6 setup - SKIP if you already ran them
git clone https://github.com/ld-singh/ai-factory-ops-lab.git
cd ai-factory-ops-lab
sudo PUBLIC_IP=<vm-ip> bash portfolio-lab/real-gpu-session/scripts/host-setup.sh   # NVIDIA toolkit + k3s + API cert
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
portfolio-lab/real-gpu-session/scripts/install-gpu-operator.sh                     # GPU Operator + DCGM + a CUDA smoke pod

Prefer to drive kubectl from your laptop instead? Open TCP 6443 to the VM and run portfolio-lab/real-gpu-session/scripts/fetch-kubeconfig.sh <user>@<vm-ip> --key <key>, then export KUBECONFIG=$PWD/kubeconfig-gpuvm. Running on the VM (above) avoids a flaky laptop↔API link.

Capture evidence¶

Whether the operator went up just now or during setup, this is all Part A still needs. Run on the VM - it writes a tarball you scp back:

# on the VM, from the repo root:
portfolio-lab/real-gpu-session/scripts/capture-evidence.sh
# then from your laptop, pull the tarball back:
scp -i <key> <user>@<vm-ip>:~/gpu-evidence-*.tgz \
  ./portfolio-lab/06-validation-reports/evidence/

capture-evidence.sh snapshots host + in-pod nvidia-smi, nvidia.com/gpu allocatable, GFD labels, the GPU Operator pods, and real DCGM metrics - the full Part A evidence set.

✅ Gate: nvidia-smi from inside a scheduled pod, and real DCGM_FI_* metrics whose values match nvidia-smi.

Option 2 - Manual steps (understand each link)¶

Same result, by hand. Each step has a Pass criteria - its checkpoint.

ℹ️ Operator already up (from the Lesson 6 setup)? Then don't reinstall - reinstalling on top of a working operator only risks breaking it. Use the steps below purely to inspect each link (the kubectl / nvidia-smi checks), then capture evidence.

VERSION WARNING: package names, image tags, and Helm chart versions drift with driver/CUDA/Kubernetes releases. Treat each command as a validated pattern and cross-check the linked official docs before running.

Step 1 - Driver¶

Most "deep learning / GPU" images ship the driver. Otherwise install per the NVIDIA driver guide.

nvidia-smi              # driver/CUDA version, GPU model
nvidia-smi -L           # GPU inventory with UUIDs
nvidia-smi topo -m      # topology (single GPU: trivial, capture anyway)

Pass: nvidia-smi lists the GPU without errors.

Step 2 - NVIDIA Container Toolkit¶

Install + configure the runtime (nvidia-ctk runtime configure) per the toolkit install guide. On a Docker host you can prove the runtime injection independent of Kubernetes:

docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi

Pass: in-container nvidia-smi matches the host. (k3s uses containerd, not Docker, so this Docker check is optional - the CUDA pod in Step 5 proves the same path.)

Step 3 - Single-node Kubernetes¶

k3s is fastest (it bundles containerd; the GPU Operator supports it). host-setup.sh uses k3s with --tls-san <public-ip> so you can drive it from your laptop.

Pass: kubectl get nodes is Ready.

Step 4 - NVIDIA GPU Operator¶

Use the official getting-started docs for the current chart version. Pattern (driver pre-installed on the host):

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
helm install gpu-operator nvidia/gpu-operator \
  -n gpu-operator --create-namespace \
  --set driver.enabled=false        # host driver already present

kubectl get pods -n gpu-operator                                   # all Running/Completed
kubectl get node -o jsonpath='{.items[0].status.allocatable.nvidia\.com/gpu}'; echo
kubectl describe node | grep nvidia.com/gpu                        # GFD labels, now REAL

Pass: node advertises nvidia.com/gpu ≥ 1 and GFD labels match the real GPU - same label names as the simulation, now with real provenance.

Step 5 - CUDA test pod¶

kubectl run cuda-test --rm -it --restart=Never \
  --image=nvidia/cuda:12.4.1-base-ubuntu22.04 \
  --overrides='{"spec":{"runtimeClassName":"nvidia"}}' \
  --limits=nvidia.com/gpu=1 -- nvidia-smi

Pass: nvidia-smi from inside a scheduled pod. The complete path, proven.

Step 6 - DCGM Exporter¶

The GPU Operator deploys it. The exporter container usually has no curl, so scrape via port-forward from the host:

kubectl -n gpu-operator port-forward <dcgm-exporter-pod> 9400:9400 &
curl -s localhost:9400/metrics | grep -E 'DCGM_FI_DEV_(GPU_UTIL|FB_USED|GPU_TEMP|POWER_USAGE)'

Pass: real DCGM_FI_* values that match nvidia-smi (temp, power, FB used).

📸 Then capture everything with capture-evidence.sh and record the versions into real-gpu-validation-report.md.

Next parts (same rental)¶

Stay on this GPU and continue the Lesson 6 session:

Part B - HAMi GPU sharing: turn this one card into enforced slices and prove multi-pod co-residency - the highest-value add-on.
Part C - inference benchmark (vLLM): real TTFT / latency / tokens-per-second on one mid-range GPU.
Part D - Slurm real --gres=gpu: the real counterpart to the fake-GRES Slurm lesson.

🔬 What this part proves - and does NOT¶

Proves: the real, end-to-end GPU runtime path on one node, plus real DCGM telemetry - exactly what Lesson 1's simulation could not.

Does NOT prove: single-node by design - no NCCL collective performance, no NVLink/NVSwitch topology, no GPUDirect RDMA, no multi-node training, nothing about production-scale fleet ops. It proves the path and telemetry, not scale. Full ledger: fake-vs-real-limitations.md.

➡️ Next: back to the Lesson 6 hub for Part B - HAMi sharing on this same GPU.