Skip to content

Lesson 7 - Security for GPU/AI Infrastructure

Get the code to run this lab

The commands on this page come from the repository, not the website. Clone it and enter this lesson's folder: git clone https://github.com/ld-singh/ai-factory-ops-lab && cd ai-factory-ops-lab/portfolio-lab/07-security. Browse this lesson on GitHub

Course home: AI Factory Operations Lab ยท Tracked as issue #10 on the roadmap

๐Ÿšง STATUS: PLANNED - coming in a future update. This page is the outline. The simulation-first lesson is on the roadmap; the shape below is what it will cover.

Every other lesson gets a GPU platform working. This one asks the next question a real operator has to answer: is it safe to run more than one tenant on it? GPU platforms are expensive, so they get shared, and sharing is where the security work lives.

๐Ÿงญ Mode: ๐ŸŸฆ Simulation-first (on the same fake GPU fleet as Lessons 1-3). The controls here are control-plane controls, so most of it is provable without hardware.

What it will cover

1. Multi-tenant isolation

  • Namespace-per-tenant boundaries, and what a namespace does and does not isolate on a shared GPU node.
  • The gap between scheduling isolation (Lesson 1) and runtime isolation (HAMi, Lesson 1C/6B), seen through a security lens: a memory cap is also a blast-radius control.

2. RBAC and quotas as security controls

  • Least-privilege RBAC for who can request GPUs, edit workloads, and read telemetry.
  • ResourceQuota / LimitRange on nvidia.com/gpu as a denial-of-service guardrail, not just a capacity one.

3. Network boundaries

  • NetworkPolicy around inference endpoints (who can call the model server).
  • Locking down the metrics surface: DCGM exporter and Prometheus expose fleet detail that shouldn't be world-readable.

4. Secrets and supply chain

  • Handling model-registry and cloud credentials the workloads need.
  • Model-artifact provenance: where the weights came from, and why "pull any HF model" is a supply-chain decision.

5. The break-it drill

  • In the spirit of the observability lesson: misconfigure one control on purpose (an over-broad RBAC role, a missing NetworkPolicy) and show the exposure, then close it.

What it will prove (and not)

Provable for free: the control-plane security posture - RBAC, quotas, network policy, namespace boundaries, and the drills that verify them. It will not prove hardware-level tenant isolation (side channels, firmware, confidential computing), which is out of scope for this course's single-node, entry-GPU focus and is called out explicitly.

๐Ÿ’ก Want to help shape or build this? Comment on issue #10.