Skip to content

Inference Benchmark Report

✅ STATUS: VALIDATED - captured on a real NVIDIA RTX A6000 (48 GB), 2026-06-27. Served with vLLM and driven by the Lesson 4 harness (make phase5-bench and drills). Benchmark numbers are only meaningful from the real GPU run; these are it.

Environment

GPU NVIDIA RTX A6000 (48 GB)
Model served Qwen/Qwen2.5-7B-Instruct via vLLM (served name local)
vLLM vllm/vllm-openai:latest (v0.23.0) on k3s, runtimeClassName: nvidia, nvidia.com/gpu: 1
Harness portfolio-lab/04-inference-serving/harness/loadgen.py (stdlib)
Date 2026-06-27

Baseline sweep (make phase5-bench)

 conc  reqs  err  gen  ttft_p50  ttft_p95  tpot_p50  e2e_p95  e2e_p99   tok/s  goodput%
    1    12    0   53     0.047     1.121    0.0219    2.261    2.640    42.3    100.0
    2    12    0   51     0.067     0.068    0.0220    1.378    1.423    84.0    100.0
    4    12    0   51     0.067     0.068    0.0221    1.316    1.335   153.1    100.0
    8    12    0   51     0.171     0.177    0.0223    1.357    1.381   240.4    100.0

tok/s scales cleanly (42 → 240) with goodput pinned at 100% - plenty of headroom at this load. (The conc-1 ttft_p95 value is first-request warmup.)

Saturation - the knee (CONCURRENCY=64,128,256,512 REQUESTS_PER_LEVEL=512)

 conc  reqs  err  gen  ttft_p50  ttft_p95  tpot_p50  e2e_p95  e2e_p99    tok/s  goodput%
   64   512    0   51     0.087     0.163    0.0275    1.678    1.792   2127.6    100.0
  128   512    0   52     0.126     0.325    0.0360    2.291    2.421   3138.1    100.0
  256   512    0   52     0.299     0.517    0.0531    3.535    3.784   4039.9    100.0
  512   512    0   52     1.043     3.564    0.0564    5.828    5.946   3896.1     50.0

Operating point: conc 256 (~4040 tok/s at 100% goodput). Throughput peaks there and falls at conc512 (3896 tok/s) while ttft_p95 spikes 0.52s → 3.56s and goodput cliffs to 50%. Past the knee, more load buys no throughput and halves the SLO - the classic throughput-vs-latency trade-off, on real hardware.

Supporting drills

  • Continuous batching (make phase5-batching): a short request's ttft_p95 rose only 0.053s → 0.069s (1.3x) with eight long requests in flight - versus ~50x for the same drill on the CPU tier. That ratio is the measured value of continuous batching (vLLM admits new requests into the running batch instead of queueing them).
  • Prefill (make phase5-prefill): ttft_p95 rose 0.049s → 0.150s as input grew 16 → 1024 tokens; tpot stayed ~0.022s. Input length is a TTFT (prefill) cost.
  • Decode (make phase5-decode): e2e grew with generated tokens while tpot (~0.022s) and ttft stayed flat - e2e ≈ ttft + tpot × gen. Output length is a duration cost.

What this proves - and does not

Proves real single-GPU serving for this card+model: throughput scaling, the saturation knee and operating point, continuous batching protecting TTFT, and the prefill/decode split - all in real numbers. It does not prove multi-replica routing, multi-node scale, NVLink/topology effects, or sharing-performance under sustained load. Ledger: fake-vs-real-limitations.md.