Lesson 5 - BCM-Style Cluster Lifecycle (Conceptual)¶
Get the code to run this lab
The commands on this page come from the repository, not the website. Clone it and enter this lesson's folder: git clone https://github.com/ld-singh/ai-factory-ops-lab && cd ai-factory-ops-lab/portfolio-lab/05-bcm-style-cluster-lifecycle. Browse this lesson on GitHub
Course home: AI Factory Operations Lab ยท Previous: Lesson 4 - Inference Serving ยท Next: Lesson 6 - Real GPU
๐ก STATUS: RUNNABLE DRILL (Phase 6), conceptual mapping.
SCOPE NOTE: this lesson does not use NVIDIA Base Command Manager and invents no BCM commands. Instead it ships a runnable drill that implements the generic node lifecycle BCM automates - provision โ health-gate โ in-service โ patch โ retire - as real Kubernetes state transitions on the KWOK fake fleet, then maps each stage back to its BCM concept. If a real BCM evaluation is ever performed, its evidence (and only then) upgrades the BCM-specific claims from conceptual to validated. BCM reference: https://docs.nvidia.com/base-command-manager/
The final lesson zooms out from "schedule and observe workloads" to "operate the cluster itself over its lifetime" - the layer a tool like NVIDIA Base Command Manager (BCM) manages. Everything in Lessons 1โ4 assumed nodes exist, run an OS, have drivers, and joined a scheduler. This lesson is about the machinery that makes that true for hundreds of nodes at once, repeatably.
๐ฏ Learning objectives - this lesson teaches you to reason about, and map to tools you know:
- Head node / compute node architecture, and why the head node is the cluster's single source of truth.
- Software images and node categories - image-based fleet management vs per-node configuration drift.
- Provisioning, health checks, and workload-manager integration.
- Patching lifecycle and user/role management.
๐งญ Mode: ๐จ Conceptual - documented as concepts and mapped to production equivalents, not run. If a real BCM evaluation is ever performed, its evidence (and only then) upgrades this from conceptual to validated.
The concept map (the lesson's core artifact)¶
Each BCM-domain concept, mapped to the generic mechanism it implements and to where this course (or common production practice) already touches it:
| Lifecycle concern | The generic mechanism | Where you've seen the idea |
|---|---|---|
| Head node | A management plane holding cluster state and serving provisioning | The kind control-plane node in Lesson 1; slurmctld in Lesson 2 |
| Software image | A golden OS image nodes boot from; change the image, not the node | Cloud machine images / immutable infrastructure |
| Node category | A group of nodes sharing one image + config (e.g. "gpu-compute") | Lesson 1's node pools (gpu-pool=a100) - same idea, lower in the stack |
| Provisioning | Network boot โ image โ node-specific finalization | Cloud-init / PXE pipelines |
| Health checks | Scripted checks gating whether a node accepts work | K8s readiness + the drain drills of Lessons 1/2 |
| WLM integration | The lifecycle layer installs/configures Slurm or K8s on nodes | Lesson 2's config files - here, generated rather than handwritten |
| Patching lifecycle | Update the image, roll node-by-node, drain โ reimage โ resume | Rolling updates; cordon/drain (Lesson 1), scontrol drain (Lesson 2) |
| Users & roles | Central identity + per-team access to partitions/queues | Lesson 1B's queues and quotas, one layer down |
๐ก The transferable insight: cluster managers are fleet-level immutability engines. The unit of change is the image plus its category, never the individual node - the same shift containers made for applications, applied to the OS/driver layer. If you can defend that sentence and walk this table, you understand what BCM is for without pretending hands-on experience.
Why conceptual is still useful¶
The course's whole discipline is not overclaiming. Rather than fake BCM output, this lesson connects BCM's lifecycle model to image pipelines, node pools, and lifecycle hooks you have operated - which is a transferable, defensible understanding without pretending to hands-on BCM experience. Mapping BCM's concepts onto systems you've actually run is a stronger position than reciting command names.
The drill (run this)¶
Needs the Phase 1 kind cluster up (make phase1-up). No GPU. It steps through the
full lifecycle interactively (press enter between stages):
It will:
- Provision a new KWOK node at
image-version=v1,lifecycle=provisioning, health-gated with aNoScheduletaint so nothing lands prematurely. - Health-gate it: run a scripted check (does it advertise GPUs?), and on pass
flip
lifecycle=in-serviceand remove the gate taint. - In-service: a workload pod (selector
lifecycle=in-service) schedules onto it. - Patch: cordon + drain, recreate the node at
image-version=v2, re-gate, re-check, reopen - the workload is evicted then reschedules (a rolling reimage). - Retire: drain and delete the node from the cluster.
โ
Checkpoint: you watch a node move provisioning โ in-service โ (v1โv2) โ retired
as real kubectl state, with a workload correctly evicted and rescheduled across the
patch. Each stage maps to a BCM concept in the table above.
๐ก Why this is defensible: it never claims to be BCM. It demonstrates the mechanism BCM automates (image-based, health-gated, drain-before-reimage lifecycle) using tools you can actually run, which is a defensible "I understand what BCM does" - stronger than reciting commands you've never executed.
See also: bcm-style-cluster-lifecycle.md.
โก๏ธ Next: Lesson 6 - Real GPU - the optional one-rental capstone that runs every real-hardware piece (GPU runtime path, HAMi sharing, Slurm GRES, inference benchmark) back to back. Then close the loop in your โ lab notebook, making sure every lesson you ran has captured evidence.