Real GPU Validation Report - Module 01 (Lesson 2 / Phase 2)¶
STATUS: ✅ VALIDATED - captured on a real NVIDIA RTX A6000 (Hyperstack), 2026-06-22. Evidence: real command output captured during the run (artifact files cited in the table below).
Environment¶
| Item | Value |
|---|---|
| Date | 2026-06-22 |
| Machine | Hyperstack GPU VM (OpenStack-Nova), Ubuntu 22.04.5 LTS, kernel 6.8.0-40-generic |
| GPU model | NVIDIA RTX A6000, 48 GB (49140 MiB), UUID GPU-f4591c93-8109-e804-f47f-35acdb86acae |
| GPU architecture | Ampere (nvidia.com/gpu.family=ampere, compute capability 8.6); MIG: N/A (not MIG-capable - software sharing is the only option, the Lesson 1C premise) |
| Driver version | 535.183.06 |
| CUDA version (driver-reported) | 12.2 (host nvidia-smi); a CUDA 12.4 base image ran fine in-pod |
| Kubernetes distro/version | k3s v1.35.5+k3s1, runtime containerd://2.2.3-k3s1 (single control-plane node, schedulable) |
| GPU Operator | full stack deployed via the nvidia/gpu-operator Helm chart (chart version not captured - helm was absent on the VM at evidence time; operand pods enumerated below) |
Validation checklist¶
| Step | Pass criteria | Result | Evidence file |
|---|---|---|---|
| Driver | nvidia-smi lists the GPU |
✅ RTX A6000, driver 535.183.06 | nvidia-smi.txt, nvidia-smi-L.txt |
| Container runtime path | in-container nvidia-smi matches host |
✅ via a CUDA pod (k3s uses containerd, not Docker, so no docker-cuda-smi) |
cuda-pod-nvidia-smi.txt |
| GPU Operator | all components Running/Completed | ✅ device-plugin, GFD, dcgm-exporter, NFD, operator-validator Running; nvidia-cuda-validator Completed |
gpu-operator-pods.txt |
| Device plugin | node nvidia.com/gpu Capacity & Allocatable >= 1 |
✅ both = 1 | node-describe.txt, gpu-allocatable.txt |
| GFD labels | real discovered labels present | ✅ gpu.product=NVIDIA-RTX-A6000, gpu.family=ampere, gpu.memory=49140, gpu.count=1, cuda.driver-version.full=535.183.06, gpu.compute.major/minor=8/6 |
node-describe.txt |
| CUDA test pod | nvidia-smi from inside a scheduled pod |
✅ pod on runtimeClassName: nvidia, nvidia.com/gpu: 1, ran nvidia-smi on the A6000 |
cuda-pod-nvidia-smi.txt |
| DCGM Exporter | real DCGM_FI_* metrics via curl |
✅ scraped real metrics, tagged with the GPU UUID + modelName="NVIDIA RTX A6000": GPU_TEMP=41, POWER_USAGE=29.6W, FB_USED=1MiB, FB_FREE=48675, SM_CLOCK=210 - all matching nvidia-smi |
dcgm-metrics.txt |
Comparison against the simulation (Lesson 1)¶
- GFD labels now have real provenance. Lesson 1's fake fleet had script-written labels; here the identical label names (
nvidia.com/gpu.product,.family,.memory,cuda.driver-version.*) are discovered by GFD from the actual A6000 - same shape, real source. - The full operator stack is present (driver-validation, container-toolkit, device-plugin, GFD, DCGM, validators) and the operator's own
nvidia-cuda-validatorran to Completed - none of which exists on the fake fleet. - Allocation is real:
nvidia.com/gpuCapacity = Allocatable = 1, and a pod actually executednvidia-smion the device. - Telemetry is real: the DCGM exporter's
DCGM_FI_DEV_*values (temp 41 °C, power 29.6 W, FB used 1 MiB) matchnvidia-smiand carry the real GPU UUID - the genuine version of the fabricatedDCGM_FI_*stream the fake-DCGM exporter produces in Lessons 3/4.
What this run proves¶
The complete GPU path on real silicon: driver → NVIDIA Container Toolkit → k3s containerd
(nvidia RuntimeClass) → GPU Operator device plugin → kubelet → scheduler → CUDA
container, with discovered GFD labels and real DCGM telemetry (values matching
nvidia-smi, tagged with the GPU UUID). This is exactly what Lesson 1's simulation
explicitly could not prove.
Scope limits¶
Single node by design. Proves the runtime path, not scale: no NCCL/collective
performance, no NVLink/NVSwitch topology, no GPUDirect RDMA, no multi-node training. It
also does not prove GPU sharing or memory-cap isolation - that's Lesson 1C Part B
(../01-k8s-gpu-platform/hami/hami-isolation-realgpu/),
run next on this same A6000. Full ledger:
fake-vs-real-limitations.md.