Runbook - Node Not Advertising nvidia.com/gpu¶
Severity: High - GPU capacity silently missing from the fleet; workloads Pending.
Applies to: real GPU clusters with NVIDIA GPU Operator. (In the simulation,
the analogue is a fake node missing its status.allocatable entry.)
🧩 Running HAMi instead of the standard device plugin? Don't chase a non-bug: HAMi advertises only
nvidia.com/gpuinstatus.allocatable(= physical GPUs ×deviceSplitCount, default 10). It does not putnvidia.com/gpumemornvidia.com/gpucoresthere - those are accounted by the HAMi scheduler from thehami.io/node-nvidia-registernode annotation and enforced per-pod by HAMi-core. So the health check isnvidia.com/gpupresent and the annotation populated:Empty annotation → check thekubectl get node -o jsonpath='{.items[0].status.allocatable.nvidia\.com/gpu}'; echo kubectl get node -o jsonpath='{.items[0].metadata.annotations.hami\.io/node-nvidia-register}'; echohami-device-pluginpod logs and that the node is labelledgpu=on. Also note HAMi requiresnvidiato be the default containerd runtime - see k3s default runtime / containerd config.
Symptom¶
- GPU pods Pending with
Insufficient nvidia.com/gpudespite a GPU node existing kubectl describe node <node>showsnvidia.com/gpu: 0or no entry at all
Triage order (walk the GPU path bottom-up)¶
The fastest diagnosis follows diagrams/gpu-path-to-pod.md from hardware up.
Each step isolates one failure domain.
1. Driver layer (on the node)¶
- Fails / no devices → driver problem. Check
dmesg | grep -i nvidia, driver package state, secure boot / kernel module signing, recent kernel upgrades that orphaned the DKMS module. Stop here and fix the driver first.
2. Container runtime layer (on the node)¶
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
# or for containerd-only nodes, check the runtime config:
grep -A3 nvidia /etc/containerd/config.toml
- Fails while host
nvidia-smiworks → Container Toolkit / runtime config problem. Re-runnvidia-ctk runtime configureper official docs and restart the runtime.
3. GPU Operator components¶
kubectl get pods -n gpu-operator -o wide | grep -E 'device-plugin|driver|toolkit|validator'
kubectl logs -n gpu-operator <device-plugin-pod>
kubectl describe pod -n gpu-operator <failing-pod>
Common findings:
- Driver daemonset pod CrashLoopBackOff → see gpu-operator-driver-pod-failing.md
- Device plugin running but logging "no devices found" → step 1 or 2 actually
failed; re-verify
- Validator pods failing → read their logs; they name the broken layer
4. Node labels / selectors¶
- GPU Operator daemonsets target nodes via feature labels; a node that lost its labels (e.g. after re-provisioning) gets no device plugin pod at all.
5. kubelet registration¶
kubectl get node <node> -o jsonpath='{.status.allocatable}' | jq .
journalctl -u kubelet --since "1 hour ago" | grep -i 'device plugin'
- Device plugin healthy but allocatable still 0 → kubelet may need a restart to
re-register the plugin socket; check for stale sockets under
/var/lib/kubelet/device-plugins/.
Resolution verification¶
kubectl get node <node> -o jsonpath='{.status.allocatable.nvidia\.com/gpu}'
kubectl run cuda-verify --rm -it --restart=Never \
--image=nvidia/cuda:12.4.1-base-ubuntu22.04 \
--limits=nvidia.com/gpu=1 -- nvidia-smi
Prevention¶
- Alert on
sum(node allocatable nvidia.com/gpu)dropping below expected fleet size (Phase 4 alert rule) - Pin GPU Operator and driver versions; treat upgrades as change-managed events
- Run the validator checks after every node provision/reboot
Drill in this lab¶
Simulation: delete the nvidia.com/gpu allocatable from one fake node and
watch scheduling behaviour change; practice the triage narrative.
Real mode: after Phase 2 setup, stop the device plugin daemonset and walk this
runbook end to end, capturing output for the validation report.