The GPU Path to a Kubernetes Pod¶
The single most useful mental model in this lab. Every link is a distinct failure domain with its own runbook.
flowchart TD
HW[NVIDIA GPU hardware] --> DRV[NVIDIA kernel driver\nnvidia-smi works on host]
DRV --> CTK[NVIDIA Container Toolkit\nruntime injects GPU into containers]
CTK --> RT[containerd / CRI runtime\nnvidia runtime class configured]
RT --> DP[NVIDIA device plugin\ndeployed by GPU Operator]
DP --> KUBELET[kubelet\nadvertises nvidia.com/gpu allocatable]
KUBELET --> API[Kubernetes API\nnode.status.allocatable]
API --> SCHED[kube-scheduler\nmatches pod requests to allocatable]
SCHED --> POD[Pod with nvidia.com/gpu limit\nCUDA visible inside container]
style HW fill:#76b900,color:#000
style POD fill:#76b900,color:#000
Simulation boundary¶
flowchart LR
subgraph REAL_ONLY["Real GPU mode only (Lesson 6)"]
HW2[GPU] --> DRV2[Driver] --> CTK2[Container Toolkit] --> DP2[Device plugin]
end
subgraph SIMULATED["Simulated by KWOK fake nodes (Phase 1)"]
ALLOC[allocatable: nvidia.com/gpu] --> SCHED2[Scheduler decision] --> PLACE[Pod placement]
end
DP2 --> ALLOC
KWOK injects allocatable directly via the Node object, so everything to the
right of the device plugin is exercised faithfully; everything to the left is
not exercised at all.
Failure-domain to runbook mapping¶
| Broken link | Symptom | Runbook |
|---|---|---|
| Driver | nvidia-smi fails on host |
runbooks/gpu-node-not-ready.md |
| Container Toolkit | docker run --gpus all fails |
runbooks/gpu-operator-driver-pod-failing.md |
| Device plugin | node shows 0 nvidia.com/gpu |
runbooks/device-plugin-not-advertising-gpus.md |
| Scheduler fit | pod Pending, Insufficient nvidia.com/gpu | runbooks/gpu-capacity-planning.md |
| DCGM | no GPU metrics | runbooks/dcgm-exporter-no-metrics.md |