Skip to content

AI Factory Operations Lab

Slurm GPU job lifecycle

Initializing search

ld-singh/ai-factory-ops-lab

Home
Overview & setup
Lessons
Deep dives
Runbooks
Diagrams
Lab notebook
About

AI Factory Operations Lab

ld-singh/ai-factory-ops-lab

Home
Overview & setup
Lessons
Lessons
- 1 - Kubernetes GPU scheduling
- 1B - Queue scheduling (KAI)
- 1C - GPU sharing (HAMi, sim)
  1C - GPU sharing (HAMi, sim)
  - Overview
  - Scheduling sim (no GPU)
- 2 - Slurm workload management
- 3 - GPU observability
- 4 - Inference serving
- 5 - Cluster lifecycle (BCM-style)
- 6 - Real GPU (one-rental capstone)
  6 - Real GPU (one-rental capstone)
- 7 - Security (planned)
Deep dives
Deep dives
Runbooks
Runbooks
Diagrams
Diagrams
Lab notebook
Lab notebook
About

Slurm GPU Job Lifecycle¶

STATUS: PLACEHOLDER - lands with Phase 3 (Module 02).

Fake vs real GPU validation

BCM-style cluster lifecycle

Built by Lovedeep Singh. Third-party tools belong to their respective projects.

Made with Material for MkDocs