Inference Benchmark Report¶
✅ STATUS: VALIDATED - captured on a real NVIDIA RTX A6000 (48 GB), 2026-06-27. Served with vLLM and driven by the Lesson 4 harness (
make phase5-benchand drills). Benchmark numbers are only meaningful from the real GPU run; these are it.
Environment¶
| GPU | NVIDIA RTX A6000 (48 GB) |
| Model served | Qwen/Qwen2.5-7B-Instruct via vLLM (served name local) |
| vLLM | vllm/vllm-openai:latest (v0.23.0) on k3s, runtimeClassName: nvidia, nvidia.com/gpu: 1 |
| Harness | portfolio-lab/04-inference-serving/harness/loadgen.py (stdlib) |
| Date | 2026-06-27 |
Baseline sweep (make phase5-bench)¶
conc reqs err gen ttft_p50 ttft_p95 tpot_p50 e2e_p95 e2e_p99 tok/s goodput%
1 12 0 53 0.047 1.121 0.0219 2.261 2.640 42.3 100.0
2 12 0 51 0.067 0.068 0.0220 1.378 1.423 84.0 100.0
4 12 0 51 0.067 0.068 0.0221 1.316 1.335 153.1 100.0
8 12 0 51 0.171 0.177 0.0223 1.357 1.381 240.4 100.0
tok/s scales cleanly (42 → 240) with goodput pinned at 100% - plenty of headroom at this load.
(The conc-1 ttft_p95 value is first-request warmup.)
Saturation - the knee (CONCURRENCY=64,128,256,512 REQUESTS_PER_LEVEL=512)¶
conc reqs err gen ttft_p50 ttft_p95 tpot_p50 e2e_p95 e2e_p99 tok/s goodput%
64 512 0 51 0.087 0.163 0.0275 1.678 1.792 2127.6 100.0
128 512 0 52 0.126 0.325 0.0360 2.291 2.421 3138.1 100.0
256 512 0 52 0.299 0.517 0.0531 3.535 3.784 4039.9 100.0
512 512 0 52 1.043 3.564 0.0564 5.828 5.946 3896.1 50.0
Operating point: conc 256 (~4040 tok/s at 100% goodput). Throughput peaks there and falls
at conc512 (3896 tok/s) while ttft_p95 spikes 0.52s → 3.56s and goodput cliffs to 50%. Past
the knee, more load buys no throughput and halves the SLO - the classic throughput-vs-latency
trade-off, on real hardware.
Supporting drills¶
- Continuous batching (
make phase5-batching): a short request'sttft_p95rose only 0.053s → 0.069s (1.3x) with eight long requests in flight - versus ~50x for the same drill on the CPU tier. That ratio is the measured value of continuous batching (vLLM admits new requests into the running batch instead of queueing them). - Prefill (
make phase5-prefill):ttft_p95rose 0.049s → 0.150s as input grew 16 → 1024 tokens;tpotstayed ~0.022s. Input length is a TTFT (prefill) cost. - Decode (
make phase5-decode):e2egrew with generated tokens whiletpot(~0.022s) andttftstayed flat -e2e ≈ ttft + tpot × gen. Output length is a duration cost.
What this proves - and does not¶
Proves real single-GPU serving for this card+model: throughput scaling, the saturation knee and
operating point, continuous batching protecting TTFT, and the prefill/decode split - all in real
numbers. It does not prove multi-replica routing, multi-node scale, NVLink/topology effects,
or sharing-performance under sustained load. Ledger:
fake-vs-real-limitations.md.