Lesson 6 setup scripts - bare GPU VM → k3s → kubeconfig¶
Get the code to run this lab
The commands on this page come from the repository, not the website. Clone it and enter this lesson's folder: git clone https://github.com/ld-singh/ai-factory-ops-lab && cd ai-factory-ops-lab/portfolio-lab/real-gpu-session/scripts. Browse this lesson on GitHub
Part of Lesson 6 - Real GPU. These automate Part 0 (host setup) and the Part A install on any bare GPU VM you get root on - e.g. Hyperstack, Lambda, or a hyperscaler GPU VM. Ubuntu/Debian assumed. They do not work on a marketplace container (Vast.ai / RunPod pods, managed notebooks): those don't let you install the container toolkit + k3s. The lab itself (Parts A–D) is driven by the Lesson 6 pages.
⚠️ Read each script before running it. They use the documented NVIDIA / k3s commands as of this writing, but package URLs, chart names, and flags drift - the scripts flag the version-sensitive bits and link the official docs. Never pipe a setup script you haven't read onto a host you're paying for.
Get the lab onto the VM¶
Clone the repo on the VM and run from the repo root:
The setup scripts live in portfolio-lab/real-gpu-session/scripts/ and are self-contained;
run them by path from the repo root (e.g.
sudo PUBLIC_IP=<vm-ip> bash portfolio-lab/real-gpu-session/scripts/host-setup.sh).
(Evidence capture is capture-evidence.sh here - self-contained, writes a tarball - so you
don't need the repo-root scripts/collect-gpu-evidence.sh.)
The flow¶
┌─ on the GPU VM (SSH in, as root) ─────────────────────────────────────────┐
│ sudo PUBLIC_IP=<vm-ip> bash host-setup.sh │
│ → driver check → NVIDIA Container Toolkit → k3s (--tls-san <vm-ip>) │
│ → labels node gpu=on. Cluster is up; API on :6443. │
└───────────────────────────────────────────────────────────────────────────┘
│ (open TCP 6443 to your laptop)
┌─ on your LAPTOP ──────────────────────────────────────────────────────────┐
│ ./fetch-kubeconfig.sh <ssh-user>@<vm-ip> │
│ → writes ./kubeconfig-gpuvm with server → https://<vm-ip>:6443 │
│ → verifies `kubectl get nodes` │
│ │
│ export KUBECONFIG=$PWD/kubeconfig-gpuvm │
│ ./install-gpu-operator.sh # GPU Operator (DCGM incl.) + CUDA smoke │
│ # or: MODE=device-plugin ./install-gpu-operator.sh (lighter on k3s) │
└───────────────────────────────────────────────────────────────────────────┘
Each script verifies itself, but you can re-check by hand. Three steps:
1. On the VM - set up the host, then confirm the node and runtime:
sudo PUBLIC_IP=<vm-ip> bash host-setup.sh
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
kubectl get nodes -o wide # node Ready
kubectl get runtimeclass nvidia # the nvidia container runtime exists
2. On your laptop - fetch the kubeconfig and confirm you can reach the cluster (open TCP 6443 to the VM first):
./fetch-kubeconfig.sh <ssh-user>@<vm-ip> --key <ssh-key> # --port N too if non-22
export KUBECONFIG=$PWD/kubeconfig-gpuvm
kubectl get nodes -o wide # reachable from your laptop, node Ready
3. GPU layer + smoke test - install the operator, then confirm the GPU is allocatable and runs CUDA:
./install-gpu-operator.sh
kubectl get nodes -o jsonpath='{.items[0].status.allocatable.nvidia\.com/gpu}'; echo # >= 1
kubectl logs cuda-smoke # nvidia-smi ran on the real GPU
After that you have a real GPU cluster you drive from your laptop. Work the Lesson 6 parts: Part A evidence is already produced by the smoke test; then Part B - HAMi, Part C - inference benchmark, and Part D - Slurm GRES. Tear the VM down the moment your evidence is captured.
The scripts¶
| Script | Runs on | Does |
|---|---|---|
host-setup.sh |
the VM (root) | NVIDIA Container Toolkit + k3s with the public IP in the API cert; labels the node gpu=on |
fetch-kubeconfig.sh |
your laptop | reads k3s.yaml over SSH, rewrites the server to the VM's public IP, verifies kubectl |
install-gpu-operator.sh |
laptop or VM | installs the GPU layer (Operator, or MODE=device-plugin) and runs a CUDA smoke-test pod |
capture-evidence.sh |
the VM | snapshots Part A evidence (host + in-pod nvidia-smi, allocatable, DCGM) into a tarball to scp back |
GPU VM requirements (any provider) and gotchas¶
- A VM you fully control (root + your own runtime), not a fixed container. A bare GPU
VM fits (Hyperstack, Lambda, hyperscaler); a marketplace container (Vast.ai/RunPod
pods, managed notebooks) does not - it can't install the toolkit + k3s. Pick an
Ubuntu image, ideally with the NVIDIA driver pre-installed (
nvidia-smialready works), to skip the slowest step. An L4 (24 GB) or RTX A6000 (48 GB) on Ubuntu 22.04 is plenty - and neither supports MIG, which is exactly HAMi's use case. - Open TCP 6443. k3s serves its API on 6443;
fetch-kubeconfig.shand your laptopkubectlneed it reachable - open it (and your SSH port) in the provider's networking/firewall settings. If you'd rather not expose the API, runinstall-gpu-operator.shon the VM instead (export KUBECONFIG=/etc/rancher/k3s/k3s.yaml). - SSH details vary. Pass a non-default port/key to
fetch-kubeconfig.shwith--port/--key. It reads the kubeconfig viasudo cat, so a sudo-capable user is enough. - The driver is the host's. These scripts never install/manage the GPU driver - that's
the provider image's. If
nvidia-smifails on the VM, fix that first (or pick a different image); nothing downstream works without it. - Cost discipline. The VM has no value once your evidence is on your laptop. Destroy it (and any separately-billed storage) in the provider's dashboard as soon as you're done.
What these prove (and don't)¶
They are setup tooling, not a validation. A green smoke test is real evidence for
Lesson 6 Part A (the runtime path) once you capture it; everything else still has to be
run and captured per its lesson. Same fake-vs-real boundary as the rest of the course:
see fake-vs-real-limitations.md.