Runbook - k3s Fails to Start After a containerd Config Change (and setting the nvidia default runtime)¶
Severity: High - k3s (the whole node's control plane + kubelet) is down, or GPU pods
run under runc and never see the GPU.
Applies to: k3s nodes where you need nvidia as the default containerd runtime
(e.g. for HAMi, whose pods don't set runtimeClassName).
Symptom¶
One of:
- After writing
…/containerd/config.toml.tmpland restarting, k3s won't come up: - GPU pods schedule but the GPU isn't visible inside them (they ran under the default
runc, notnvidia).
Root cause¶
k3s generates its containerd config (/var/lib/rancher/k3s/agent/etc/containerd/config.toml)
on every start. Hand-editing that file doesn't stick, and a bad config.toml.tmpl produces
invalid TOML that crashes the embedded containerd. The two traps:
- Redeclaring an existing table.
{{ template "base" . }}already emits the CRIcontainerdtable; appending another[plugins.…containerd]to setdefault_runtime_nameis a duplicate table - invalid TOML - and k3s won't start. - Wrong schema version. Newer k3s ships containerd config
version = 3, where the runtime table is[plugins.'io.containerd.cri.v1.runtime']and the template file must beconfig-v3.toml.tmpl. A v2-style snippet ([plugins."io.containerd.grpc.v1.cri".containerd]) in a v3 config is wrong.
Resolution¶
1. Recover a crash-looping k3s¶
sudo rm -f /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl
sudo rm -f /var/lib/rancher/k3s/agent/etc/containerd/config-v3.toml.tmpl
sudo systemctl restart k3s && sleep 5 && systemctl is-active k3s # → active
2. Set the default runtime the clean way (no TOML editing)¶
k3s has a first-class flag - use it via /etc/rancher/k3s/config.yaml:
k3s server --help | grep -i default-runtime # confirm the flag exists
# set the key idempotently (replace if present, else append) - appending blindly with
# `tee -a` can create a duplicate YAML key, which is parser-dependent and may be ignored:
sudo touch /etc/rancher/k3s/config.yaml
if grep -qE '^[[:space:]]*default-runtime[[:space:]]*:' /etc/rancher/k3s/config.yaml; then
sudo sed -i 's/^[[:space:]]*default-runtime[[:space:]]*:.*/default-runtime: nvidia/' /etc/rancher/k3s/config.yaml
else
echo 'default-runtime: nvidia' | sudo tee -a /etc/rancher/k3s/config.yaml
fi
sudo systemctl restart k3s && sleep 5 && systemctl is-active k3s
(The nvidia runtime must already be defined - it is when the NVIDIA Container Toolkit was
installed before k3s, which auto-creates it.)
3. Verify¶
sudo grep default_runtime_name /var/lib/rancher/k3s/agent/etc/containerd/config.toml
# expect: default_runtime_name = "nvidia"
If your k3s build lacks
--default-runtime, the supported fallback is a template that matches your config schema -config-v3.toml.tmplforversion = 3- and you set the key inside the existing runtime table rather than redeclaring it. See the k3s advanced docs.
Prevention¶
- Never hand-edit the generated
config.toml- k3s overwrites it on restart. - Prefer the
--default-runtimeflag over a template; it's schema-agnostic. - If you must template, first check the schema:
head -1 …/containerd/config.toml(version = 2vs3) and use the matchingconfig*.toml.tmpl; never redeclare an existing table. - After any change, gate on
systemctl is-active k3sand a GPU smoke pod before moving on.
Drill in this lab¶
Lesson 6 Part B - HAMi isolation
needs nvidia as the default runtime; set-default-runtime.sh
applies step 2 idempotently and verifies step 3.