Skip to content
AI Factory Operations Lab
Slurm GPU job lifecycle
Initializing search
ld-singh/ai-factory-ops-lab
Home
Overview & setup
Lessons
Deep dives
Runbooks
Diagrams
Lab notebook
About
AI Factory Operations Lab
ld-singh/ai-factory-ops-lab
Home
Overview & setup
Lessons
Lessons
1 - Kubernetes GPU scheduling
1B - Queue scheduling (KAI)
1C - GPU sharing (HAMi, sim)
1C - GPU sharing (HAMi, sim)
Overview
Scheduling sim (no GPU)
2 - Slurm workload management
3 - GPU observability
4 - Inference serving
5 - Cluster lifecycle (BCM-style)
6 - Real GPU (one-rental capstone)
6 - Real GPU (one-rental capstone)
Overview
Setup scripts (bare GPU VM → k3s)
Part A - GPU runtime path
Part B - HAMi isolation (real GPU)
Part C - Inference benchmark (real GPU)
Part D - Slurm real GRES
7 - Security (planned)
Deep dives
Deep dives
kind (local cluster)
KWOK (fake nodes)
fake-gpu-operator
Control plane app
Runbooks
Runbooks
Overview
Device plugin not advertising GPUs
k3s default runtime / containerd config
GPU node not ready
GPU operator driver pod failing
CUDA_VISIBLE_DEVICES debugging
GPU memory pressure
GPU capacity planning
DCGM exporter no metrics
KAI queue starvation
Slurm job pending (GRES)
Slurm node drained
Diagrams
Diagrams
Architecture
GPU path to a pod
Fake vs real GPU validation
Slurm GPU job lifecycle
BCM-style cluster lifecycle
Lab notebook
Lab notebook
Overview
Fake vs real limitations
Local simulation report
Slurm GRES validation
Real GPU validation
Inference benchmark
About
Slurm GPU Job Lifecycle
¶
STATUS: PLACEHOLDER - lands with Phase 3 (Module 02).
Back to top