Skip to content

The GPU Path to a Kubernetes Pod

The single most useful mental model in this lab. Every link is a distinct failure domain with its own runbook.

flowchart TD
    HW[NVIDIA GPU hardware] --> DRV[NVIDIA kernel driver\nnvidia-smi works on host]
    DRV --> CTK[NVIDIA Container Toolkit\nruntime injects GPU into containers]
    CTK --> RT[containerd / CRI runtime\nnvidia runtime class configured]
    RT --> DP[NVIDIA device plugin\ndeployed by GPU Operator]
    DP --> KUBELET[kubelet\nadvertises nvidia.com/gpu allocatable]
    KUBELET --> API[Kubernetes API\nnode.status.allocatable]
    API --> SCHED[kube-scheduler\nmatches pod requests to allocatable]
    SCHED --> POD[Pod with nvidia.com/gpu limit\nCUDA visible inside container]

    style HW fill:#76b900,color:#000
    style POD fill:#76b900,color:#000

Simulation boundary

flowchart LR
    subgraph REAL_ONLY["Real GPU mode only (Lesson 6)"]
        HW2[GPU] --> DRV2[Driver] --> CTK2[Container Toolkit] --> DP2[Device plugin]
    end
    subgraph SIMULATED["Simulated by KWOK fake nodes (Phase 1)"]
        ALLOC[allocatable: nvidia.com/gpu] --> SCHED2[Scheduler decision] --> PLACE[Pod placement]
    end
    DP2 --> ALLOC

KWOK injects allocatable directly via the Node object, so everything to the right of the device plugin is exercised faithfully; everything to the left is not exercised at all.

Failure-domain to runbook mapping

Broken link Symptom Runbook
Driver nvidia-smi fails on host runbooks/gpu-node-not-ready.md
Container Toolkit docker run --gpus all fails runbooks/gpu-operator-driver-pod-failing.md
Device plugin node shows 0 nvidia.com/gpu runbooks/device-plugin-not-advertising-gpus.md
Scheduler fit pod Pending, Insufficient nvidia.com/gpu runbooks/gpu-capacity-planning.md
DCGM no GPU metrics runbooks/dcgm-exporter-no-metrics.md