Lesson 7 - Security for GPU/AI Infrastructure¶
Get the code to run this lab
The commands on this page come from the repository, not the website. Clone it and enter this lesson's folder: git clone https://github.com/ld-singh/ai-factory-ops-lab && cd ai-factory-ops-lab/portfolio-lab/07-security. Browse this lesson on GitHub
Course home: AI Factory Operations Lab ยท Tracked as issue #10 on the roadmap
๐ง STATUS: PLANNED - coming in a future update. This page is the outline. The simulation-first lesson is on the roadmap; the shape below is what it will cover.
Every other lesson gets a GPU platform working. This one asks the next question a real operator has to answer: is it safe to run more than one tenant on it? GPU platforms are expensive, so they get shared, and sharing is where the security work lives.
๐งญ Mode: ๐ฆ Simulation-first (on the same fake GPU fleet as Lessons 1-3). The controls here are control-plane controls, so most of it is provable without hardware.
What it will cover¶
1. Multi-tenant isolation¶
- Namespace-per-tenant boundaries, and what a namespace does and does not isolate on a shared GPU node.
- The gap between scheduling isolation (Lesson 1) and runtime isolation (HAMi, Lesson 1C/6B), seen through a security lens: a memory cap is also a blast-radius control.
2. RBAC and quotas as security controls¶
- Least-privilege RBAC for who can request GPUs, edit workloads, and read telemetry.
- ResourceQuota / LimitRange on
nvidia.com/gpuas a denial-of-service guardrail, not just a capacity one.
3. Network boundaries¶
- NetworkPolicy around inference endpoints (who can call the model server).
- Locking down the metrics surface: DCGM exporter and Prometheus expose fleet detail that shouldn't be world-readable.
4. Secrets and supply chain¶
- Handling model-registry and cloud credentials the workloads need.
- Model-artifact provenance: where the weights came from, and why "pull any HF model" is a supply-chain decision.
5. The break-it drill¶
- In the spirit of the observability lesson: misconfigure one control on purpose (an over-broad RBAC role, a missing NetworkPolicy) and show the exposure, then close it.
What it will prove (and not)¶
Provable for free: the control-plane security posture - RBAC, quotas, network policy, namespace boundaries, and the drills that verify them. It will not prove hardware-level tenant isolation (side channels, firmware, confidential computing), which is out of scope for this course's single-node, entry-GPU focus and is called out explicitly.
๐ก Want to help shape or build this? Comment on issue #10.