Why Static Batching Wastes GPUs
How do you serve an LLM to many users at once? The naive answer is static batching : collect $N$ incoming requests, pad every sequence to the length of the longest one, and push the entire batch through the model together. This is exactly how training works (fixed-size batches, uniform sequence length), so it's a natural first attempt at serving.
But generation is fundamentally different from training. During training, every sequence in a batch is processed in a single forward pass. During generation, each request produces tokens one at a time in an autoregressive loop, and different requests finish at wildly different times. One user might ask "What is 2+2?" (a few tokens) while another asks for a 500-word essay. With static batching, the short request finishes in a handful of decode steps, but its slot in the batch remains occupied (filled with padding tokens) until the longest request in the batch completes. The GPU is doing real multiply-accumulate operations on those pad tokens, but producing nothing useful.
How bad is the waste? Consider a batch of $N$ requests where the maximum output length is $L_{\text{max}}$ but the average output length is $\bar{L}$. Every request is padded to $L_{\text{max}}$, so the total compute spent on padding is proportional to $N \cdot (L_{\text{max}} - \bar{L})$, while the useful compute is proportional to $N \cdot \bar{L}$. The fraction of compute wasted on padding is:
Let's check the boundaries. If every request happens to produce exactly the same number of tokens, $\bar{L} = L_{\text{max}}$ and waste $= 0$ (no padding needed, the ideal case). If $L_{\text{max}} = 2048$ but the average response is 200 tokens ($\bar{L} = 200$), then waste $= 1 - 200/2048 \approx 0.90$. Ninety percent of the GPU's FLOPs are spent computing attention over padding tokens that produce no output. In practice, output length distributions are heavily right-skewed (most responses are short, a few are very long), so the longest request in any reasonably sized batch tends to be far longer than the average.
There's a second, subtler problem: latency coupling . The user who asked "What is 2+2?" has to wait for the essay to finish, because the batch doesn't release any request until all requests complete. Throughput (tokens/second across all requests) is capped by the slowest request, and per-request latency (time-to-first-token and time-to-last-token) balloons for every short request unlucky enough to share a batch with a long one.
In summary: static batching treats the batch as an atomic unit. No request can enter or leave mid-generation. This wastes compute on padding and couples every request's latency to the worst case. We need a way to let requests come and go independently.
Continuous Batching (In-Flight Batching)
The fix is surprisingly simple in concept: instead of operating at the request level (wait for the entire batch to finish), operate at the iteration level (check after every single decode step). This idea was formalised as iteration-level scheduling in the Orca paper (Yu et al., 2022) , and the serving community now calls it continuous batching or in-flight batching .
Here's how it works. The system maintains a running batch of up to $B$ active requests. After each decode step (one token generated per active request), the scheduler checks: has any request emitted an end-of-sequence token or hit its maximum length? If so, it evicts that request from the batch and immediately inserts a new request from the waiting queue into the freed slot. The batch stays full (or as full as the queue allows) at every iteration, and no request waits for another to finish.
An analogy makes this concrete. Static batching is like a tour bus: everyone boards at the same time, the bus drives the full route, and nobody gets off until the last stop. Passengers with nearby destinations sit idle for the entire ride. Continuous batching is like a subway: passengers board and exit at every station. As soon as someone gets off, their seat is available for the next person waiting on the platform. The train is always full as long as there are passengers waiting.
The throughput gain comes from two sources. First, no compute is wasted on padding, because each active slot contains a real, in-progress request generating real tokens. Second, the batch stays full: as soon as a short request finishes, a new request from the queue takes its place, so the GPU never idles waiting for the slowest request. If the queue is deep enough (more incoming requests than batch slots), every decode step processes exactly $B$ real tokens.
We can quantify the improvement. In static batching, throughput over one "batch cycle" is $\sum_{i=1}^{N} L_i$ useful tokens generated in $L_{\text{max}}$ decode steps (because every step processes $N$ sequences but only $\sum_i \mathbf{1}[t \leq L_i]$ are doing useful work at step $t$). In continuous batching with a deep queue, the throughput over $L_{\text{max}}$ decode steps is $B \cdot L_{\text{max}}$ useful tokens, because every slot is occupied at every step. The ratio is:
Here we assume $B = N$ (same batch size). When $\bar{L} = L_{\text{max}}$ (all requests are the same length), the ratio is 1 โ continuous batching offers no advantage, because there's no waste to eliminate. When $\bar{L} = 200$ and $L_{\text{max}} = 2048$, the ratio is $2048/200 \approx 10\times$. That tenfold throughput improvement is not from faster hardware or a smaller model. It's purely from eliminating idle compute. In production workloads with high length variance, continuous batching commonly delivers 2-10x throughput gains over static batching.
Latency improves as well. A short request no longer waits for the longest request in its batch. It enters the batch, generates its tokens, and exits, with its total latency determined only by its own length plus any time spent waiting in the queue. The scheduler can also apply priority policies (e.g., prioritise interactive chat over batch summarisation) that were impossible with static batching's all-or-nothing model.
But continuous batching introduces a new problem. Requests now have different context lengths and different lifetimes. The KV cache (the key-value tensors stored from previous tokens so we don't recompute attention from scratch at each step) must be managed dynamically: allocated when a request enters, grown as it generates tokens, and freed when it exits. How do we manage this memory efficiently?
The KV Cache Fragmentation Problem
During autoregressive generation, each transformer layer caches the key and value projections of every token generated so far. This KV cache is what makes generation $O(n)$ per step instead of $O(n^2)$: at step $t$, we only compute the query for the new token and attend over the cached keys and values from steps $1$ through $t-1$, rather than reprocessing the entire sequence. For a model with $L$ layers, $h$ attention heads, and head dimension $d_h$, the KV cache for a single request at sequence position $t$ occupies:
The factor of 2 accounts for both keys and values. For a 7B-parameter model (e.g. Llama-2-7B with $L=32$, $h=32$, $d_h=128$) using float16 (2 bytes per element), a single request at maximum context length $t = 4096$ requires $2 \times 32 \times 32 \times 128 \times 4096 \times 2 \approx 2.1$ GB. For a batch of 32 requests all at max length, that's 67 GB of KV cache alone, which would not fit on most GPUs. And this is a 7B model โ larger models need proportionally more.
The naive approach to managing this memory is to pre-allocate $L_{\text{max}} \times 2 \times L \times h \times d_h$ bytes for every request when it enters the batch, reserving space for the maximum possible sequence length. This is what HuggingFace's default
generate()
function does. It's simple, but catastrophically wasteful when combined with continuous batching, because most requests won't use anywhere near $L_{\text{max}}$ tokens.
This creates two kinds of fragmentation, directly analogous to classic operating-system memory problems:
- Internal fragmentation: each request has a contiguous block of $L_{\text{max}}$ slots reserved, but it only uses $L_i \ll L_{\text{max}}$ of them. The unused slots within each allocation are wasted. If $L_{\text{max}} = 4096$ and a request only generates 150 tokens, $96\%$ of its reserved memory is never touched.
- External fragmentation: when requests finish and their memory is freed, the freed blocks may not be contiguous with each other. A new request needing a large contiguous allocation might fail even though the total free memory is sufficient, because it's scattered across non-adjacent gaps between still-active requests.
How much memory does this actually waste? With pre-allocation, the memory reserved for a batch of $N$ requests is $N \cdot L_{\text{max}} \cdot c$ (where $c$ is the per-token KV cache size), but the memory actually used is $\sum_{i=1}^{N} L_i \cdot c = N \cdot \bar{L} \cdot c$. The utilisation ratio is:
This is the same ratio we saw for compute waste, and the same boundaries apply: $100\%$ utilisation only when all requests use the full context length, and as low as $5\text{--}10\%$ in realistic workloads. Since GPU memory directly limits how many requests can be in the batch simultaneously, wasting $90\%$ of KV cache memory means the effective batch size is roughly $10\times$ smaller than what the hardware could support if memory were managed efficiently. Smaller batches mean lower throughput and higher queuing latency.
To summarise the situation: continuous batching solved the compute waste problem (no more padding), but it exposed a memory waste problem (KV caches for variable-length requests are hard to manage without fragmentation). We need a memory management scheme that allocates exactly what each request needs, no more, and can reclaim and reuse memory without contiguity constraints. Operating systems solved this exact problem decades ago with virtual memory and paging. Can we apply the same idea to KV caches?
PagedAttention: Virtual Memory for KV Cache
The answer is yes, and the technique is called PagedAttention , introduced in the vLLM paper (Kwon et al., 2023) . The core insight is a direct transplant from operating-system design: treat KV cache memory the way an OS treats RAM, using paging . Instead of allocating one contiguous block per request, divide all KV cache memory into fixed-size blocks (pages) , and allocate them on demand as each request grows.
Here's the mechanism in detail. The physical KV cache memory on the GPU is divided into blocks, each holding the keys and values for a fixed number of tokens (typically $P = 16$ tokens per block). A single block for one layer and one head stores $2 \times P \times d_h$ elements (keys and values, $P$ token positions, head dimension $d_h$). Across all layers and heads, one logical "block" for a request stores:
Each active request maintains a page table : a mapping from logical block indices (block 0 holds tokens 0-15, block 1 holds tokens 16-31, etc.) to physical block addresses in GPU memory. When a request is first scheduled, it receives one block. As it generates tokens and fills that block, a new block is allocated from the free pool and appended to its page table. When a request finishes, all its blocks are returned to the free pool immediately.
Why does this solve both fragmentation problems? Internal fragmentation is bounded: the only wasted space is in the last block of each request, which on average wastes $P/2$ token slots. If $P = 16$, that's at most 16 token slots per request โ compare that to wasting $L_{\text{max}} - L_i$ slots (potentially thousands) under pre-allocation. For a request generating $L_i$ tokens, the waste fraction per request is:
When $L_i = 200$ and $P = 16$, waste is at most $16/200 = 8\%$. When $L_i = 2000$, waste drops to $0.8\%$. Compare this to the pre-allocation waste of $1 - L_i / L_{\text{max}}$, which for $L_i = 200$ and $L_{\text{max}} = 4096$ is $95\%$. The improvement is dramatic. External fragmentation disappears entirely, because blocks are all the same size and any free block can be allocated to any request regardless of where it sits in physical memory. There's no need for contiguous allocation.
The total number of blocks that fit on the GPU determines the system's capacity. If the GPU has $M$ bytes of memory available for KV cache and each block is $b$ bytes, then the system has $\lfloor M / b \rfloor$ blocks. With pre-allocation, each request reserves $\lceil L_{\text{max}} / P \rceil$ blocks regardless of actual usage. With paged allocation, each request uses only $\lceil L_i / P \rceil$ blocks. The maximum batch size under paging is:
The ratio $B_{\text{paged}} / B_{\text{pre-alloc}} \approx L_{\text{max}} / \bar{L}$, which can be $10\text{--}20\times$ for typical workloads. More requests in the batch means higher throughput, and this comes at zero cost โ we're just using memory that was previously wasted. The vLLM paper reports that PagedAttention achieves near-optimal memory utilisation: typically $> 95\%$ of allocated KV cache memory is actively storing useful key-value pairs, compared to $20\text{--}60\%$ with naive pre-allocation.
PagedAttention enables one more important optimisation:
copy-on-write (CoW) sharing
. In many serving scenarios, multiple requests share the same prefix โ for example, a system prompt that's prepended to every user message, or multiple candidate responses generated from the same input (beam search, parallel sampling). Under paging, these requests can share the same physical blocks for their common prefix. Each request's page table points to the same physical blocks for the shared tokens, and only when a request diverges (generates a different token) does it get its own copy of the diverging block. This is exactly how Unix
fork()
works with process memory.
For a system prompt of $S$ tokens shared by $N$ concurrent requests, copy-on-write saves $N \cdot S \cdot c - S \cdot c = (N-1) \cdot S \cdot c$ bytes (where $c$ is the per-token cache size). If $S = 500$, $N = 32$, and $c = 0.5$ KB per token (typical for a 7B model), the saving is $31 \times 500 \times 0.5$ KB $\approx 7.6$ MB. For larger models and longer system prompts, the savings scale proportionally and can free up enough memory to add several more requests to the batch.
Putting It Together: vLLM Architecture
vLLM (Kwon et al., 2023) combines continuous batching and PagedAttention with optimised CUDA kernels into a complete serving system. Understanding the full request lifecycle through vLLM ties together everything we've discussed.
A request arrives via the API and enters a waiting queue . The scheduler runs at every iteration and decides which requests to promote from the queue into the active batch. It checks: are there enough free KV cache blocks to admit a new request? If so, it allocates an initial block from the free pool, runs the prefill phase (processing the full input prompt in a single forward pass to populate the KV cache), and the request begins decoding. At each subsequent decode step, the request generates one token. If its current block is full, the scheduler allocates a new block. When the request emits an end-of-sequence token, all its blocks are returned to the free pool and the slot opens for the next queued request.
The scheduler also handles preemption for memory pressure. If the free pool runs out of blocks (all memory is in use by active requests), the scheduler can swap out a low-priority request: copy its KV cache blocks to CPU memory, free the GPU blocks, and admit a higher-priority request. When GPU memory becomes available again, the swapped request's blocks are copied back and it resumes from where it left off. This is exactly analogous to OS page swapping between RAM and disk.
Here's how to set up a vLLM server and query it:
# โโ Terminal 1: Start the vLLM server โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# pip install vllm
# This launches an OpenAI-compatible API server with continuous batching
# and PagedAttention enabled by default.
# vllm serve meta-llama/Llama-2-7b-chat-hf \
# --max-model-len 4096 \
# --gpu-memory-utilization 0.90
# โโ Terminal 2: Send requests โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="unused", # vLLM doesn't require a real key
)
# These requests are batched continuously by the server โ
# short ones finish and free their KV cache blocks immediately,
# while long ones keep generating without blocking anyone.
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain PagedAttention in two sentences."},
],
max_tokens=256,
temperature=0.7,
)
print(response.choices[0].message.content)
You can also use vLLM's offline (non-server) API directly for batch inference:
from vllm import LLM, SamplingParams
# Load the model โ PagedAttention and continuous batching
# are handled internally.
llm = LLM(
model="meta-llama/Llama-2-7b-chat-hf",
max_model_len=4096,
gpu_memory_utilization=0.90,
)
prompts = [
"What is 2 + 2?", # very short response
"Write a 500-word essay on climate change.", # long response
"Translate 'hello' to French.", # short response
"Explain quantum computing in detail.", # medium response
]
params = SamplingParams(temperature=0.7, max_tokens=1024)
# vLLM handles scheduling, KV cache paging, and eviction
# internally. Short requests finish first and free memory
# for new work โ no padding, no waiting.
outputs = llm.generate(prompts, params)
for output in outputs:
prompt = output.prompt
generated = output.outputs[0].text
print(f"Prompt: {prompt[:60]}...")
print(f"Response: {generated[:100]}...")
print(f"Tokens: {len(output.outputs[0].token_ids)}")
print()
The performance numbers are striking. The vLLM paper benchmarks against HuggingFace Transformers'
generate()
(static batching, pre-allocated KV cache) and reports
2-4x throughput improvement on short-output workloads and up to 24x on long-output, high-variance workloads
. The gains are largest when output lengths are highly variable (maximising the waste that static batching incurs) and when the server is under heavy load (keeping the continuous batch full).
To build intuition for where the gains come from, the simulation below models static vs continuous batching for a set of requests with variable output lengths. It tracks GPU slot utilisation (fraction of batch slots doing useful work) at each decode step:
import json, random, js
random.seed(42)
BATCH_SIZE = 8
NUM_REQUESTS = 24
# Generate requests with variable output lengths (right-skewed)
request_lengths = [random.randint(10, 40) if random.random() < 0.7
else random.randint(80, 150) for _ in range(NUM_REQUESTS)]
# โโ Static batching simulation โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
static_util = []
idx = 0
while idx < NUM_REQUESTS:
batch = request_lengths[idx:idx + BATCH_SIZE]
max_len = max(batch)
for step in range(max_len):
active = sum(1 for l in batch if step < l)
static_util.append(active / BATCH_SIZE)
idx += BATCH_SIZE
# โโ Continuous batching simulation โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
cont_util = []
active_requests = []
queue = list(request_lengths)
step_count = 0
max_steps = len(static_util)
while step_count < max_steps and (active_requests or queue):
# Fill empty slots from queue
while len(active_requests) < BATCH_SIZE and queue:
active_requests.append(queue.pop(0))
if not active_requests:
break
# Record utilisation
cont_util.append(len(active_requests) / BATCH_SIZE)
# Decrement remaining lengths, remove finished
active_requests = [r - 1 for r in active_requests]
active_requests = [r for r in active_requests if r > 0]
step_count += 1
# Pad to same length for plotting
max_plot = max(len(static_util), len(cont_util))
static_util += [0] * (max_plot - len(static_util))
cont_util += [0] * (max_plot - len(cont_util))
# Sample every few steps for readability
step = max(1, max_plot // 60)
x_labels = [str(i) for i in range(0, max_plot, step)]
static_sampled = [round(static_util[i], 2) for i in range(0, max_plot, step)]
cont_sampled = [round(cont_util[i], 2) for i in range(0, max_plot, step)]
avg_static = sum(static_util) / len(static_util)
avg_cont = sum(cont_util) / len(cont_util)
plot_data = [
{
"title": "GPU Slot Utilisation: Static vs Continuous Batching",
"x_label": "Decode Step",
"y_label": "Utilisation (fraction of slots active)",
"x_data": x_labels,
"lines": [
{"label": f"Static (avg {avg_static:.0%})", "data": static_sampled, "color": "#ef4444"},
{"label": f"Continuous (avg {avg_cont:.0%})", "data": cont_sampled, "color": "#22c55e"},
]
}
]
js.window.py_plot_data = json.dumps(plot_data)
total_tokens = sum(request_lengths)
static_steps = sum(max(request_lengths[i:i+BATCH_SIZE])
for i in range(0, NUM_REQUESTS, BATCH_SIZE))
print(f"Requests: {NUM_REQUESTS}, Batch size: {BATCH_SIZE}")
print(f"Output lengths: min={min(request_lengths)}, max={max(request_lengths)}, "
f"avg={sum(request_lengths)/len(request_lengths):.0f}")
print(f"Static batching: {static_steps} total decode steps, "
f"avg utilisation {avg_static:.1%}")
print(f"Continuous batching: {len([u for u in cont_util if u > 0])} total decode steps, "
f"avg utilisation {avg_cont:.1%}")
print(f"Throughput ratio: {avg_cont/avg_static:.2f}x")
The plot shows the fundamental difference: static batching has periodic drops in utilisation as short requests finish but their slots remain locked until the batch completes (the sawtooth pattern), while continuous batching maintains near-100% utilisation by immediately refilling completed slots from the queue.
Today, virtually every production LLM serving framework uses continuous batching with some form of paged KV cache management. vLLM popularised the idea, but the same principles appear in TensorRT-LLM (NVIDIA), TGI (HuggingFace), and DeepSpeed-FastGen (Microsoft). The details differ (block size, scheduling policy, kernel implementation), but the architecture is the same: iteration-level scheduling, dynamic memory allocation via paging, and copy-on-write sharing for common prefixes.
Quiz
Test your understanding of continuous batching and PagedAttention.
In static batching with $L_{\text{max}} = 2048$ and an average output length of $\bar{L} = 256$, what fraction of GPU compute is wasted on padding?
What is the key scheduling difference between static batching and continuous batching?
Why does PagedAttention virtually eliminate external fragmentation in KV cache memory?
How does copy-on-write (CoW) in PagedAttention save memory when multiple requests share a system prompt?