Why Is Inference So Slow?

A 70-billion-parameter model can chew through thousands of tokens per second during training, spread across a cluster of GPUs processing entire sequences in parallel. But at inference time, that same model generates roughly 30 to 50 tokens per second per request. That's a dramatic slowdown, and the reason isn't what most people assume. It isn't that the GPU is too weak or the model is too large in some abstract sense. The bottleneck is far more specific, and understanding it is the key to every optimisation technique in this track.

During training, we already know the full target sequence (the ground-truth text the model is learning from). A technique called teacher forcing lets us feed the entire sequence into the model at once: all positions are computed in a single forward pass, and the loss for every token is computed in parallel. The model reads, say, 2,048 tokens and produces 2,048 predictions simultaneously. The GPU's thousands of cores stay busy.

Inference is fundamentally different. Language models are autoregressive : each token depends on all the tokens that came before it. We generate token 1, append it to the context, generate token 2, append it, and so on. There is no way around this sequential dependency. And here's the critical insight: during this sequential generation, the GPU is mostly idle . The bottleneck isn't compute — it's memory bandwidth . Each token generation step requires reading the model's entire weight set from GPU memory, but only performs a tiny amount of computation with those weights. The GPU spends most of its time waiting for data to arrive from memory, not doing math.

💡 Think of it like a factory with 10,000 workers (the GPU cores) but a single narrow door to the warehouse (memory bandwidth). During training, each worker gets a large crate of parts to assemble (many tokens per weight load). During decode, each worker gets a single screw (one token per weight load) — they finish instantly and then stand idle waiting for the next delivery through that narrow door.

This track is about understanding and fixing that bottleneck. We'll start here by quantifying exactly why decode is memory-bound, then work through the key techniques: reusing computation with KV caching (article 2), compressing model weights via quantisation (article 3), serving multiple requests with continuous batching (article 4), speculating ahead with draft models (article 5), and making attention itself faster (article 6).

Prefill vs Decode: Two Very Different Phases

When a user sends a prompt to a language model, inference actually happens in two distinct phases that behave completely differently from the GPU's perspective. Understanding this split is essential, because the optimisation strategy for each phase is different.

The prefill phase processes the entire user prompt at once. All prompt tokens are known ahead of time, so we can compute attention for all positions in a single large batched matrix multiply. If the prompt has 1,000 tokens, the GPU processes all 1,000 in one pass, filling every core with useful work. This phase is compute-bound — the GPU's arithmetic units are the bottleneck, not memory bandwidth. The GPU is busy, and that's the regime it was designed for.

The decode phase generates tokens one at a time. Each step produces exactly one new token, which means we read the model's entire weight set from high-bandwidth memory (HBM) just to perform a single token's worth of matrix-vector multiplications. This phase is memory-bandwidth-bound — the GPU finishes its tiny amount of arithmetic almost instantly and then waits for the next batch of weights to arrive from memory.

We can quantify this difference using arithmetic intensity — the ratio of computation to memory traffic:

$$\text{Arithmetic Intensity} = \frac{\text{FLOPs}}{\text{Bytes Transferred}}$$

This metric tells us whether a workload is compute-bound (high intensity — lots of FLOPs per byte loaded) or memory-bound (low intensity — few FLOPs per byte loaded). For a deeper treatment of how this connects to hardware limits, see the roofline model in the GPU track.

During prefill with $n$ prompt tokens, every weight matrix that we load from memory gets multiplied by an $n$-row input matrix. We perform $O(n)$ FLOPs for each byte of weights we transfer, so arithmetic intensity scales linearly with $n$. With a 1,000-token prompt, we do 1,000 times more useful work per byte loaded than with a single token. That pushes us firmly into the compute-bound regime where the GPU's TFLOPS are the limiting factor.

During decode, $n = 1$. We load every weight matrix to multiply it by a single vector. We do $O(1)$ FLOPs per byte transferred — the arithmetic intensity is as low as it can possibly get. The GPU's massive parallel compute sits idle while memory fetches dominate wall-clock time.

Let's make this concrete. Consider a 7B-parameter model stored in FP16 (2 bytes per parameter = 14 GB of weights). Generating a single token requires reading all 14 GB from HBM to the GPU's compute units. On an NVIDIA A100, HBM bandwidth is roughly 2 TB/s. So the minimum time per token is:

$$t_{\text{token}} = \frac{14 \times 10^9 \text{ bytes}}{2 \times 10^{12} \text{ bytes/s}} = 7 \text{ ms}$$

That's about 140 tokens per second — the theoretical maximum for batch size 1. But the A100 has 312 TFLOPS of FP16 compute. How much of that are we actually using? A single matrix-vector multiply for one token at one layer produces on the order of a few billion FLOPs across the whole model, while the GPU could perform 312 trillion per second. We're using a tiny fraction of the available compute. The GPU is starved for data.

💡 This is exactly why batching helps so much during decode. If we process 32 requests simultaneously, we load the weights from memory once but perform 32 matrix-vector multiplies (effectively a matrix-matrix multiply). That multiplies the arithmetic intensity by 32, pushing us back toward the compute-bound regime. We get 32 times the throughput at roughly the same latency per token — one of the most important optimisations in inference serving, which we'll cover in article 4.

Where Does the Time Go?

Now that we know decode is memory-bandwidth-bound, let's trace exactly what happens during a single decode step and see where the time is spent. For each new token, the model must execute the following:

  • Step 1: Load the token embedding for the newly generated token (small — a single vector lookup).
  • Step 2: For each transformer layer: load the Q/K/V projection weight matrices from HBM, compute attention between the new token and all previous tokens (reading the KV cache), load the output projection weights, load the FFN weights (two large matrices), and compute the FFN.
  • Step 3: Load the language model head (often a large vocabulary projection matrix), compute logits over the full vocabulary, apply softmax , and sample the next token.

The dominant cost is step 2, repeated across every layer. A 7B model has around 32 layers, and each layer's weight matrices must be read from HBM. A 70B model has around 80 layers with larger hidden dimensions, multiplying the data transfer proportionally. The time per token is overwhelmingly determined by how fast we can stream those weights from HBM into the GPU's on-chip SRAM, where the actual arithmetic happens.

This gives us a remarkably clean formula for estimating decode latency at batch size 1:

$$t_{\text{token}} \approx \frac{2P}{B_{\text{mem}}}$$

Let's unpack every symbol:

  • $P$ — the number of parameters in the model (e.g. $7 \times 10^9$ for a 7B model). This determines the total amount of weight data.
  • $2P$ — the model size in bytes, assuming FP16 storage (2 bytes per parameter). If we quantise to INT8, this becomes $1P$; for INT4, it becomes $0.5P$. The factor of 2 is specific to the precision — it's the only thing quantisation changes in this formula, which is why quantisation is such a powerful inference optimisation (article 3).
  • $B_{\text{mem}}$ — the memory bandwidth of the GPU in bytes per second. For an A100 SXM this is approximately $2 \times 10^{12}$ bytes/s (2 TB/s). For an H100 SXM it's roughly $3.35 \times 10^{12}$ bytes/s (3.35 TB/s).

Why is this formula only approximate? Because it ignores the attention computation (which reads the KV cache — proportional to sequence length, not model size), kernel launch overhead, and other smaller costs. But for short-to-medium sequences at batch size 1, weight loading dominates and this formula is surprisingly accurate.

Let's walk through the boundary cases:

  • 7B model on A100 (2 TB/s): $t = \frac{14 \times 10^9}{2 \times 10^{12}} = 7$ ms per token, or roughly 140 tokens/s. Fast enough for interactive use, but we're wasting over 99% of the GPU's 312 TFLOPS.
  • 70B model on A100 (2 TB/s): $t = \frac{140 \times 10^9}{2 \times 10^{12}} = 70$ ms per token, or roughly 14 tokens/s. Still usable but noticeably slow, and the entire 80 GB of HBM is nearly filled by the model weights alone.
  • 70B model on H100 (3.35 TB/s): $t = \frac{140 \times 10^9}{3.35 \times 10^{12}} \approx 42$ ms per token, or roughly 24 tokens/s. The 1.7$\times$ bandwidth improvement of H100 over A100 translates almost directly into a 1.7$\times$ speedup — exactly what you'd expect from a memory-bound workload.
  • 70B model quantised to INT4 on A100: $t = \frac{35 \times 10^9}{2 \times 10^{12}} \approx 17.5$ ms per token, or roughly 57 tokens/s. Quantisation from FP16 to INT4 cuts the numerator by 4$\times$, quadrupling throughput. This is the single biggest lever in inference optimisation.
📌 These are theoretical maxima — real systems are always slower. Attention computation grows with sequence length and isn't captured in this formula. KV cache reads add memory traffic. Kernel launch overhead, layer normalisations, and non-overlapped memory transfers all contribute. Typical real-world throughput is 50-70% of the theoretical maximum from this formula.

The table below computes theoretical maximum tokens per second for several model sizes and GPUs, using the formula above:

import json, js

# Model sizes in billions of parameters
models = [
    ("1.3B",   1.3e9),
    ("7B",     7e9),
    ("13B",   13e9),
    ("70B",   70e9),
    ("405B", 405e9),
]

# GPUs: name, bandwidth in bytes/s, HBM capacity in GB
gpus = [
    ("A100 (2.0 TB/s)",  2.0e12,  80),
    ("H100 (3.35 TB/s)", 3.35e12, 80),
    ("H200 (4.8 TB/s)",  4.8e12, 141),
]

rows = []
for model_name, params in models:
    model_bytes_fp16 = params * 2  # FP16: 2 bytes per param
    model_gb = model_bytes_fp16 / 1e9
    row = [model_name, f"{model_gb:.0f} GB"]
    for gpu_name, bw, hbm_gb in gpus:
        if model_gb > hbm_gb:
            row.append("OOM")
        else:
            t_ms = (model_bytes_fp16 / bw) * 1000  # ms per token
            tok_s = 1000 / t_ms
            row.append(f"{tok_s:.0f} tok/s")
    rows.append(row)

js.window.py_table_data = json.dumps({
    "headers": ["Model", "FP16 Size"] + [g[0] for g in gpus],
    "rows": rows
})

print("Theoretical max decode throughput (batch_size=1, FP16)")
print("Formula: tok/s = bandwidth / (2 * params)")
print()
print("Key insight: throughput is purely a function of model size")
print("and memory bandwidth. Compute (TFLOPS) barely matters here.")
💡 Notice how the 405B model doesn't even fit on a single A100 or H100 in FP16. This is why tensor parallelism (splitting the model across multiple GPUs) is mandatory for large models, and why quantisation (article 3) is so attractive — an INT4-quantised 70B model is only 35 GB, comfortably fitting on a single 80 GB GPU while also quadrupling decode throughput.

Why Can't We Just Parallelise?

If decode is so painfully sequential, why can't we parallelise it the way training parallelises across the sequence? After all, GPUs are parallel machines — surely we can do better than processing one token at a time?

The answer lies in the autoregressive dependency . During training, all target tokens are known (teacher forcing), so computing the prediction at position 500 doesn't require waiting for the prediction at position 499 — they're all computed from the ground truth. But during inference, the input to position 500 is the prediction from position 499. We can't know what to feed into step $t$ until step $t-1$ has finished. This is a true data dependency that no amount of hardware parallelism can break.

So what can we parallelise? Three dimensions remain open:

  • Batch dimension: serve multiple users simultaneously. If 32 requests arrive, we can process them together by batching their matrix-vector products into a single matrix-matrix product. This doesn't make any individual request faster, but it dramatically improves throughput (tokens per second across all requests) because we amortise the weight-loading cost across multiple tokens. This is the core idea behind continuous batching (article 4).
  • Model dimension (tensor parallelism): split a model's weight matrices across multiple GPUs. Each GPU holds a slice of each layer, and they communicate partial results via fast interconnects (NVLink). This reduces the per-GPU memory load, which allows larger models to serve at all, and can improve latency if the inter-GPU communication is fast enough. The tradeoff is synchronisation overhead.
  • Speculative dimension: instead of generating one token at a time, use a small, fast draft model to guess multiple tokens ahead, then verify them all in parallel with the large model. If the guesses are correct (which they often are for predictable tokens), we get multiple tokens for the cost of one large-model forward pass. This is speculative decoding (article 5).

But the sequential token generation for a single request remains the fundamental constraint. We cannot generate token 50 until we've generated token 49. This is the reason inference optimisation exists as a field — we're trying to squeeze maximum throughput out of massively parallel hardware that's shackled to a sequential workload.

Every technique in this track attacks the problem from a different angle. The diagram below maps them to the bottleneck they address:

  • KV caching (article 2): avoid redundant computation by caching intermediate results from previous tokens, so each decode step only computes the new token's contribution.
  • Quantisation (article 3): shrink the numerator of $t_{\text{token}} = 2P / B_{\text{mem}}$ by reducing bytes per parameter. INT4 means $0.5P$ instead of $2P$ — a 4$\times$ speedup in the memory-bound regime.
  • Continuous batching (article 4): increase arithmetic intensity by processing many requests per weight load, shifting from the memory-bound to the compute-bound regime.
  • Speculative decoding (article 5): break the one-token-at-a-time constraint by guessing ahead and verifying in parallel.
  • Efficient attention (article 6): reduce the memory and compute cost of the attention mechanism itself, which grows with sequence length and becomes a secondary bottleneck for long contexts.
💡 In practice, production serving systems combine nearly all of these techniques simultaneously. A typical setup might use an INT4-quantised model with KV caching, continuous batching of 64+ concurrent requests, tensor parallelism across 2-4 GPUs, and FlashAttention for the attention kernels. Each technique compounds on the others.

With the bottleneck now clearly identified — memory bandwidth during sequential decode — we're ready to tackle the first and most fundamental optimisation: reusing computation across decode steps with the KV cache, which is the subject of the next article.

Quiz

Test your understanding of the inference bottleneck and why decode is fundamentally different from training.

During autoregressive decode with batch_size=1, what is the primary bottleneck?

Why is the prefill phase compute-bound while the decode phase is memory-bound?

A 13B-parameter model in FP16 is served on a GPU with 3 TB/s memory bandwidth. What is the approximate theoretical maximum decode throughput at batch_size=1?

Why can't we parallelise token generation for a single request the way training parallelises across the sequence?