Runbook - Node Not Advertising nvidia.com/gpu¶

Severity: High - GPU capacity silently missing from the fleet; workloads Pending. Applies to: real GPU clusters with NVIDIA GPU Operator. (In the simulation, the analogue is a fake node missing its status.allocatable entry.)

🧩 Running HAMi instead of the standard device plugin? Don't chase a non-bug: HAMi advertises only nvidia.com/gpu in status.allocatable (= physical GPUs × deviceSplitCount, default 10). It does not put nvidia.com/gpumem or nvidia.com/gpucores there - those are accounted by the HAMi scheduler from the hami.io/node-nvidia-register node annotation and enforced per-pod by HAMi-core. So the health check is nvidia.com/gpu present and the annotation populated:
kubectl get node -o jsonpath='{.items[0].status.allocatable.nvidia\.com/gpu}'; echo
kubectl get node -o jsonpath='{.items[0].metadata.annotations.hami\.io/node-nvidia-register}'; echo
Empty annotation → check the hami-device-plugin pod logs and that the node is labelled gpu=on. Also note HAMi requires nvidia to be the default containerd runtime - see k3s default runtime / containerd config.

Symptom¶

GPU pods Pending with Insufficient nvidia.com/gpu despite a GPU node existing
kubectl describe node <node> shows nvidia.com/gpu: 0 or no entry at all

Triage order (walk the GPU path bottom-up)¶

The fastest diagnosis follows diagrams/gpu-path-to-pod.md from hardware up. Each step isolates one failure domain.

1. Driver layer (on the node)¶

nvidia-smi

Fails / no devices → driver problem. Check dmesg | grep -i nvidia, driver package state, secure boot / kernel module signing, recent kernel upgrades that orphaned the DKMS module. Stop here and fix the driver first.

2. Container runtime layer (on the node)¶

docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
# or for containerd-only nodes, check the runtime config:
grep -A3 nvidia /etc/containerd/config.toml

Fails while host nvidia-smi works → Container Toolkit / runtime config problem. Re-run nvidia-ctk runtime configure per official docs and restart the runtime.

3. GPU Operator components¶

kubectl get pods -n gpu-operator -o wide | grep -E 'device-plugin|driver|toolkit|validator'
kubectl logs -n gpu-operator <device-plugin-pod>
kubectl describe pod -n gpu-operator <failing-pod>

Common findings: - Driver daemonset pod CrashLoopBackOff → see gpu-operator-driver-pod-failing.md - Device plugin running but logging "no devices found" → step 1 or 2 actually failed; re-verify - Validator pods failing → read their logs; they name the broken layer

4. Node labels / selectors¶

kubectl get node <node> --show-labels | tr ',' '\n' | grep -i nvidia

GPU Operator daemonsets target nodes via feature labels; a node that lost its labels (e.g. after re-provisioning) gets no device plugin pod at all.

5. kubelet registration¶

kubectl get node <node> -o jsonpath='{.status.allocatable}' | jq .
journalctl -u kubelet --since "1 hour ago" | grep -i 'device plugin'

Device plugin healthy but allocatable still 0 → kubelet may need a restart to re-register the plugin socket; check for stale sockets under /var/lib/kubelet/device-plugins/.

Resolution verification¶

kubectl get node <node> -o jsonpath='{.status.allocatable.nvidia\.com/gpu}'
kubectl run cuda-verify --rm -it --restart=Never \
  --image=nvidia/cuda:12.4.1-base-ubuntu22.04 \
  --limits=nvidia.com/gpu=1 -- nvidia-smi

Prevention¶

Alert on sum(node allocatable nvidia.com/gpu) dropping below expected fleet size (Phase 4 alert rule)
Pin GPU Operator and driver versions; treat upgrades as change-managed events
Run the validator checks after every node provision/reboot

Drill in this lab¶

Simulation: delete the nvidia.com/gpu allocatable from one fake node and watch scheduling behaviour change; practice the triage narrative. Real mode: after Phase 2 setup, stop the device plugin daemonset and walk this runbook end to end, capturing output for the validation report.