From Single-GPU to Production
Everything we've covered so far in this track has been about making a single request fast on a single GPU. The KV cache avoids redundant attention computation. Quantisation shrinks the model weights so they transfer faster from memory. Continuous batching keeps the GPU full by letting requests enter and leave the batch independently. FlashAttention makes the attention kernel itself faster by fusing operations and staying in SRAM. These are all single-machine optimisations. But production LLM serving looks nothing like one user typing into a notebook.
In production, thousands of concurrent users are sending requests with wildly different prompt lengths and output requirements. There are latency SLAs — a chatbot might require time-to-first-token under 500ms and inter-token latency under 50ms, while a batch summarisation job cares only about total throughput. The model itself may not fit on a single GPU: a Llama-3-70B in float16 requires roughly 140 GB just for weights, while the largest consumer GPU (an H100 SXM) has 80 GB. And even if it fits, one GPU can't serve the request volume a popular product demands.
This article covers the systems-level decisions that bridge that gap: how to split a model across multiple GPUs when it won't fit on one, how to separate the prefill and decode phases so they don't interfere with each other, the fundamental tradeoff between throughput and latency, and how the major open-source serving frameworks compare. These are the decisions that turn a working prototype into a service handling 10,000 requests per second.
Tensor Parallelism vs Pipeline Parallelism
When a model doesn't fit on a single GPU, we have to split it across multiple GPUs. But how? There are two fundamental strategies, and they make very different tradeoffs between latency and communication cost.
Tensor Parallelism (TP) splits individual layers across GPUs. Each GPU holds a slice of every weight matrix. When a token passes through a layer, every GPU computes its slice of the matrix multiplication simultaneously, then all GPUs exchange partial results via an all-reduce operation to produce the full output. The key insight is that every GPU works on every token at the same time — the parallelism happens within each layer, not across layers.
What does this buy us? Latency stays roughly the same as a single GPU (all GPUs compute in parallel), and memory is split evenly (each GPU stores $1/t$ of the weights, where $t$ is the TP degree). But there's a heavy communication cost: an all-reduce operation after every layer in every forward pass. For a 32-layer model with TP=8, that's 32 all-reduce operations per token, each exchanging data across 8 GPUs. This is why TP is almost always used within a single node, where GPUs are connected by NVLink (900 GB/s bidirectional on H100) rather than across nodes, where InfiniBand tops out around 50-100 GB/s. The interconnect bandwidth is the bottleneck.
How much data does each all-reduce actually move? For a single transformer layer processing one token, each GPU must send and receive a volume proportional to the hidden dimension. The total all-reduce volume per layer per token is:
Here $t$ is the TP degree (number of GPUs), $d_{\text{hidden}}$ is the hidden dimension, and $b_{\text{precision}}$ is bytes per element (2 for float16, 1 for int8). The factor $2(t-1)/t$ comes from the ring all-reduce algorithm: each GPU sends and receives $(t-1)/t$ of the data, and the factor of 2 accounts for the reduce-scatter and all-gather phases. Let's check the boundaries. When $t = 1$ (single GPU), the numerator is $2(1-1)/1 = 0$ — no communication needed, which is correct: there's nothing to synchronise. When $t = 2$, we get $2(1)/2 = 1$, meaning each GPU transfers one full copy of $d_{\text{hidden}} \times b_{\text{precision}}$ bytes. When $t = 8$, we get $2(7)/8 = 1.75$, meaning the total traffic is 1.75 times the data size. As $t \to \infty$, the factor approaches 2, so communication cost asymptotes — you never pay more than twice the data size regardless of how many GPUs you use.
For a concrete example, consider Llama-3-70B ($d_{\text{hidden}} = 8192$) with TP=8 and float16. Per layer per token, the all-reduce transfers $1.75 \times 8192 \times 2 \approx 28{,}672$ bytes, about 28 KB. Across 80 layers, that's roughly 2.2 MB per token. On NVLink at 900 GB/s, this takes about 2.5 microseconds — negligible. But across InfiniBand at 50 GB/s, the same transfer takes about 45 microseconds, and across 80 layers that's 3.6 milliseconds per token. At 30 tokens per second decode speed, 3.6ms per token is over 10% of the total time budget per token (33ms). That's why TP across nodes is impractical for latency-sensitive serving.
Pipeline Parallelism (PP) takes the opposite approach: instead of splitting each layer across GPUs, it assigns entire layers to different GPUs. GPU 0 runs layers 1-20, GPU 1 runs layers 21-40, GPU 2 runs layers 41-60, and GPU 3 runs layers 61-80. A token's activations flow sequentially through the pipeline: GPU 0 computes its layers, sends the output activations to GPU 1, which computes its layers, sends to GPU 2, and so on.
The communication cost is much lower than TP. Instead of an all-reduce after every layer, PP only transfers activation tensors between pipeline stages — once per stage boundary, not once per layer. For a model with hidden dimension $d_{\text{hidden}}$ and a batch of $B$ tokens, each inter-stage transfer is just $B \times d_{\text{hidden}} \times b_{\text{precision}}$ bytes. With $d_{\text{hidden}} = 8192$, $B = 1$, and float16, that's about 16 KB per stage boundary — tiny compared to all-reduce across 8 GPUs.
But PP has a fundamental latency problem: the pipeline is sequential. A token must pass through every stage in order, and while GPU 0 is working, GPUs 1-3 are idle (and vice versa). This creates a pipeline bubble — idle time where GPUs wait for their predecessors to finish. During training, micro-batching tricks can partially fill the bubble, but during decode (generating one token at a time), the bubble is unavoidable. With 4 pipeline stages, each decode step has roughly 75% idle time across GPUs, because only one stage is active at a time.
For serving, the choice is clear: TP is preferred because it minimises latency per request . All GPUs work simultaneously on every token, so per-token latency is close to what a single (impossibly large) GPU would achieve. PP is used when you run out of NVLink-connected GPUs . A typical H100 node has 8 GPUs connected by NVLink, giving TP up to 8. If you need more than 8 GPUs (say, for a 405B model), you use TP=8 within the node and PP across nodes, combining both strategies. The TP handles the latency-critical intra-layer parallelism over fast NVLink, while PP handles the cross-node communication over slower InfiniBand where the lower bandwidth is acceptable because less data moves.
With TP degree $t$, the memory per GPU for the model weights is approximately:
The KV cache is not divided by $t$ in the general case — it depends on the attention architecture. With standard multi-head attention, each GPU stores $1/t$ of the KV heads, so the cache is also split. But with grouped-query attention (GQA) , which uses fewer KV heads than query heads, the KV cache is already small and may be replicated across GPUs for simplicity. The key takeaway is that TP divides the weight memory cleanly, but the KV cache behaviour depends on the attention design.
import json, js
# Compare TP and PP characteristics
rows = [
["What is split", "Each layer's weight matrices (columns/rows)", "Entire layers assigned to different GPUs"],
["Communication pattern", "All-reduce after every layer", "Activation transfer between stages only"],
["Communication volume", "2(t-1)/t * d_hidden * b per layer per token", "B * d_hidden * b per stage boundary"],
["Latency impact", "Low (all GPUs work simultaneously)", "High (sequential pipeline, bubble overhead)"],
["Interconnect requirement", "Fast (NVLink: 900 GB/s)", "Moderate (InfiniBand: 50-100 GB/s OK)"],
["Typical deployment", "Within a node (2-8 GPUs)", "Across nodes"],
["GPU utilisation during decode", "High (all GPUs active per step)", "Low (only 1 stage active at a time)"],
["Best for serving?", "Yes (latency-optimised)", "Fallback when TP alone is insufficient"],
]
js.window.py_table_data = json.dumps({
"headers": ["Property", "Tensor Parallelism (TP)", "Pipeline Parallelism (PP)"],
"rows": rows
})
print("TP splits within layers (horizontal cut), PP splits across layers (vertical cut).")
print("For serving: always prefer TP within the node, add PP only when you need more GPUs than NVLink connects.")
Prefill-Decode Disaggregation
We established in article 1 that prefill (processing the input prompt) is compute-bound while decode (generating tokens one at a time) is memory-bandwidth-bound . They stress different hardware resources, they take very different amounts of time, and — critically — when they run on the same GPU, they interfere with each other. This interference is the central problem that prefill-decode disaggregation solves.
Here's the problem in concrete terms. Suppose a GPU is in the middle of decoding for 32 active requests — each decode step takes roughly 30ms, giving users a smooth 33 tokens/second stream. Now a new request arrives with a 4,000-token prompt. The serving system must prefill those 4,000 tokens before the new request can join the decode batch. That prefill might take 200-500ms, depending on the model and GPU. During that entire time, the 32 existing decode requests are stalled — no new tokens are generated for any of them. From the user's perspective, the token stream freezes for half a second, then resumes. This is called a prefill stall or generation stutter , and it directly violates latency SLAs.
The solution, formalised in the DistServe paper (Zhong et al., 2024) , is to physically separate the two phases onto different GPUs. Prefill GPUs handle only prompt processing: they receive the input, compute attention across all prompt tokens, produce the KV cache, and pass it to a decode GPU. Decode GPUs handle only autoregressive generation: they receive the KV cache from a prefill GPU and generate tokens one at a time until the request completes. No decode GPU ever runs a prefill, and no prefill GPU ever runs a decode loop.
Why does this help? Three reasons:
- No prefill stalls during decode. Since decode GPUs never run prefill operations, the token stream for active requests is never interrupted. Inter-token latency becomes predictable, which is exactly what SLAs require.
- Hardware-appropriate scheduling. Prefill is compute-bound, so prefill GPUs benefit from high FLOPS (they want to crunch through 4,000 tokens of matrix multiplies as fast as possible). Decode is memory-bandwidth-bound, so decode GPUs benefit from high memory bandwidth. In principle, you could use different GPU types for each phase, though in practice most deployments use the same hardware and simply dedicate different pools to each role.
- Independent scaling. If prefill demand spikes (many long prompts arriving at once), you can scale up the prefill pool without affecting the decode pool, and vice versa. This decoupling makes capacity planning simpler.
The main cost of disaggregation is KV cache transfer . After prefill completes, the full KV cache for that request must be sent from the prefill GPU to a decode GPU. For a model with $L$ layers, $n_{\text{kv}}$ KV heads, head dimension $d_h$, and a prompt of $s$ tokens in float16, the transfer size is:
For Llama-3-70B ($L = 80$, $n_{\text{kv}} = 8$ with GQA, $d_h = 128$) with a 4,000-token prompt in float16: $2 \times 80 \times 8 \times 128 \times 4000 \times 2 \approx 1.31$ GB. Over InfiniBand at 50 GB/s, that transfer takes about 26ms — added directly to the time-to-first-token. Over NVLink within a node, it's under 2ms. This is why disaggregated systems prefer to keep prefill and decode GPUs close together (ideally in the same rack or connected by high-speed fabric).
Let's check the formula at the boundaries. With $s = 1$ (a single-token prompt), the transfer is negligible — there's barely any KV cache to move, and disaggregation offers little benefit since prefill is nearly instantaneous anyway. With $s = 128{,}000$ (a very long context), the transfer grows to roughly 42 GB, which at 50 GB/s takes almost a second. For extremely long contexts, the transfer cost becomes significant, and system designers must weigh it against the benefit of avoiding prefill stalls.
Throughput vs Latency: The Serving Tradeoff
When evaluating an LLM serving system, three metrics dominate:
- Time to First Token (TTFT): how long from when the user sends a request until the first output token appears. This is dominated by prefill time (processing the entire input prompt), plus any queuing delay.
- Time Per Output Token (TPOT): the inter-token latency during decode — how long between successive output tokens. This is what determines the "streaming speed" the user perceives. A TPOT of 30ms means roughly 33 tokens per second, which feels fast and fluid. A TPOT of 100ms (10 tokens/second) feels sluggish.
- Throughput: total tokens generated per second across all requests. This is the metric that determines how many users the system can serve simultaneously and, ultimately, the cost per token.
These three metrics are in fundamental tension. Every serving decision that improves one tends to hurt another. The clearest example is batch size. With a larger batch, more requests share each weight-loading operation, so total throughput goes up (more tokens per second across all users). But each individual decode step takes longer, because the GPU must read more KV cache entries and do more work per step, so TPOT goes up (each user sees slower token streaming).
We can express this more precisely. With batch size $B$, the throughput of the system is:
where $t_{\text{token}}(B)$ is the time per decode step when the batch contains $B$ active requests. This function is not linear in $B$. At small batch sizes, the workload is memory-bandwidth-bound: each decode step reads all the model weights regardless of batch size, so $t_{\text{token}}(B)$ barely increases as we add more requests to the batch (the weight-reading cost dominates and is amortised). Throughput grows nearly linearly with $B$ in this regime — the GPU was idle anyway, and now it's doing useful work. At large batch sizes, the workload becomes compute-bound: the matrix multiplications for $B$ tokens actually saturate the GPU's arithmetic units, and $t_{\text{token}}(B)$ starts growing proportionally with $B$. Throughput flattens out — adding more requests just makes everyone slower without increasing total output.
Let's check the boundaries. When $B = 1$, we have minimum throughput but also minimum TPOT — a single user gets the GPU's full attention. When $B \to \infty$, throughput saturates at the GPU's peak FLOPS divided by the FLOPs per token, and TPOT grows without bound (each user's tokens take longer and longer because they're sharing the GPU with an ever-growing crowd). The optimal operating point is somewhere in between: the batch size where throughput is high but TPOT still meets the SLA. Finding this operating point is one of the most important tuning decisions in production serving.
The other major tradeoff involves context length. Longer input prompts mean longer prefill time, which increases TTFT. They also mean larger KV caches per request, which means fewer requests fit in the batch (less GPU memory available for concurrent KV caches), which reduces throughput. And quantisation sits on yet another axis: lower-precision weights transfer faster from memory (improving TPOT) and use less memory (allowing larger batches, improving throughput), but may reduce output quality.
Different applications land in very different spots on this tradeoff surface:
import json, js
rows = [
["Interactive chatbot", "< 500ms", "< 50ms (20+ tok/s)", "Moderate", "Optimise for latency (small batch, fast prefill, disaggregation)"],
["Code completion", "< 200ms", "< 30ms (33+ tok/s)", "Moderate", "Ultra-low latency (speculative decoding, small models)"],
["Batch summarisation", "Don't care", "Don't care", "Maximum", "Optimise for throughput (large batch, high utilisation)"],
["Document analysis (long ctx)", "< 2s", "< 60ms", "Moderate", "Balance prefill cost vs decode speed"],
["Agentic workflows", "< 1s", "< 40ms", "High", "Many short requests; optimise for both TTFT and throughput"],
]
js.window.py_table_data = json.dumps({
"headers": ["Use Case", "TTFT Target", "TPOT Target", "Throughput Priority", "Serving Strategy"],
"rows": rows
})
print("Each use case demands a different balance of TTFT, TPOT, and throughput.")
print("There is no single 'best' configuration — the SLA determines the operating point.")
The Serving Framework Landscape
You don't have to build any of this from scratch. Several open-source frameworks package continuous batching, tensor parallelism, quantisation, and optimised attention kernels into ready-to-deploy serving systems. Each makes different tradeoffs, and understanding those tradeoffs is essential for choosing the right tool.
vLLM (Kwon et al., 2023) is the most widely adopted open-source LLM serving engine. Its signature innovation is PagedAttention (covered in article 4 ), which manages KV cache memory using virtual-memory-style paging to eliminate fragmentation. vLLM supports continuous batching, tensor parallelism (up to 8-way and beyond), a wide range of quantisation formats (GPTQ, AWQ, FP8, GGUF), and prefix caching for shared system prompts. It's the default choice for most teams: it works out of the box, handles the widest range of models, and has the largest community. Its main weakness is raw per-request latency — the Python-heavy scheduler can add overhead compared to more tightly compiled solutions.
TGI (Text Generation Inference) (HuggingFace, 2023) is HuggingFace's serving solution. Written in Rust with a Python model layer, it integrates tightly with the HuggingFace model hub: point it at a model ID and it handles downloading, sharding, and serving. TGI supports continuous batching, tensor parallelism, quantisation (GPTQ, AWQ, EETQ, bitsandbytes), and Flash Attention. Its strength is ease of deployment, especially for teams already in the HuggingFace ecosystem. Its throughput is competitive with vLLM for most workloads, though it has fewer advanced scheduling options.
TensorRT-LLM (NVIDIA, 2023) is NVIDIA's solution. It compiles models into optimised TensorRT engines with custom CUDA kernels, achieving the lowest per-request latency of any framework on NVIDIA hardware. The compilation step fuses operations, optimises memory layouts, and applies hardware-specific optimisations that general-purpose frameworks can't match. The tradeoff is complexity: models must be explicitly converted (not all architectures are supported out of the box), the build process is slow, and debugging is harder. TensorRT-LLM is the right choice when you need maximum performance on NVIDIA GPUs and can invest the engineering time.
SGLang (Zheng et al., 2024) takes a co-design approach: instead of treating the frontend (how you write prompts and structure LLM calls) and backend (how requests are batched and executed) as separate concerns, it optimises them together. Its key innovation is RadixAttention , which automatically shares KV cache across requests that have common prefixes using a radix tree data structure. If 100 requests share the same system prompt, the system prompt's KV cache is computed once and shared across all 100. SGLang is particularly strong for structured generation tasks (where many requests share long prefixes or follow tree-like branching patterns) and for agentic workloads where the same model is called repeatedly with overlapping contexts.
Ollama / llama.cpp occupy a different niche entirely. llama.cpp is a C/C++ inference engine optimised for running quantised models on CPUs (and Apple Silicon GPUs), using the GGUF quantisation format. Ollama wraps llama.cpp in a user-friendly CLI and API. Neither is designed for production-scale serving with thousands of concurrent users — they lack continuous batching, tensor parallelism across multiple GPUs, and the scheduling sophistication of the frameworks above. But for local development, edge deployment, and single-user inference on consumer hardware, they are unmatched in simplicity.
import json, js
rows = [
["vLLM", "High-throughput serving", "Yes (multi-node)", "GPTQ, AWQ, FP8, GGUF", "Continuous + PagedAttention", "Largest community, widest model support"],
["TGI", "HuggingFace ecosystem", "Yes", "GPTQ, AWQ, EETQ, bnb", "Continuous", "Easy HF Hub integration, Rust scheduler"],
["TensorRT-LLM", "Max per-request speed", "Yes (multi-node)", "FP8, INT8, INT4 (via TRT)", "Continuous (in-flight)", "Compiled kernels, lowest latency on NVIDIA"],
["SGLang", "Structured / agentic", "Yes", "GPTQ, AWQ, FP8", "Continuous + RadixAttention", "KV cache sharing across common prefixes"],
["Ollama / llama.cpp", "Local / edge / dev", "No (single GPU/CPU)", "GGUF (Q4, Q5, Q8, etc.)", "Basic / static", "Runs on CPU, Apple Silicon; simple CLI"],
]
js.window.py_table_data = json.dumps({
"headers": ["Framework", "Best For", "Tensor Parallelism", "Quantisation Support", "Batching Strategy", "Distinguishing Feature"],
"rows": rows
})
print("No single framework wins on every axis.")
print("vLLM is the safe default. TensorRT-LLM if you need lowest latency.")
print("SGLang if you have heavy prefix sharing. Ollama for local development.")
Everything Composes
Throughout this track, we've examined each optimisation in isolation. But the power of these techniques is that they compose . A production LLM serving system doesn't choose between quantisation and continuous batching — it uses both, plus FlashAttention, plus tensor parallelism, plus prefill-decode disaggregation. Each technique targets a different bottleneck, and because the bottlenecks are largely independent, the speedups stack multiplicatively.
Here is the full optimisation stack for a state-of-the-art production serving system, and the bottleneck each layer addresses:
import json, js
rows = [
["1. Model architecture", "GQA (Grouped-Query Attention)", "KV cache size", "Trained-in: fewer KV heads = smaller cache per token"],
["2. Weight compression", "AWQ / GPTQ INT4 quantisation", "Weight memory + bandwidth", "4x less memory, 2-4x faster weight loading"],
["3. Attention kernel", "FlashAttention + FlashDecoding", "Attention compute + memory", "Fused kernel avoids materialising attention matrix"],
["4. KV cache management", "PagedAttention + INT8 cache quantisation", "KV cache memory", "No fragmentation + 2x cache compression"],
["5. Batch scheduling", "Continuous batching", "GPU utilisation", "No idle slots; requests enter/leave per iteration"],
["6. Model parallelism", "Tensor parallelism (within node)", "Single-GPU memory limit", "Split weights across GPUs; near-linear memory scaling"],
["7. Phase scheduling", "Prefill-decode disaggregation", "Prefill-decode interference", "Predictable decode latency; no prefill stalls"],
["8. Generation strategy", "Speculative decoding (optional)", "Autoregressive bottleneck", "Draft model proposes multiple tokens; verified in parallel"],
]
js.window.py_table_data = json.dumps({
"headers": ["Layer", "Technique", "Bottleneck Addressed", "Effect"],
"rows": rows
})
print("All 8 layers are active simultaneously in a production system.")
print("They compose because each targets a different bottleneck:")
print(" - Layers 1-2 reduce what needs to be stored and transferred")
print(" - Layers 3-4 make attention and caching efficient")
print(" - Layers 5-7 scale across requests, GPUs, and phases")
print(" - Layer 8 attacks the sequential token-by-token generation itself")
To appreciate how these multiply, consider a concrete (simplified) example. Start with a naive baseline: a 70B float16 model running on a single GPU with static batching, standard attention, and no quantisation — the
model.generate()
call you'd write in a notebook. Suppose this achieves 5 tokens/second for a single user, and the GPU can handle a batch of 4 before running out of memory.
- INT4 quantisation cuts weight size by 4x, speeding up the memory-bound decode phase by roughly 2-3x (not 4x, because of quantisation/dequantisation overhead). Single-user speed: ~12 tokens/second. Memory freed allows a larger batch.
- Continuous batching + PagedAttention eliminates padding waste and memory fragmentation, allowing a batch of 32 instead of 4. Total throughput: 32 requests at ~10 tokens/second each = ~320 tokens/second total (vs the baseline's ~20).
- Tensor parallelism across 8 GPUs divides weight memory by 8, freeing room for even larger batches (say, 128 concurrent requests) and reducing per-step latency. Total throughput approaches ~1,000+ tokens/second.
- Prefill-decode disaggregation removes latency spikes, ensuring the 50th-percentile TPOT is close to the 99th-percentile — consistent streaming speed even under heavy load.
The gap between "model.generate() on a single GPU" and "production serving at 10,000 requests per second" is not about a different model or a different architecture. The model weights are identical. It's entirely about the serving system: how memory is managed, how requests are scheduled, how the model is distributed across hardware, and how the two fundamentally different phases of inference are orchestrated. Every technique in this track contributes one piece of that puzzle.
Understanding these layers doesn't just help you deploy models — it helps you understand why deployed models behave the way they do. Why does time-to-first-token increase with prompt length? Prefill is compute-bound and scales with input size. Why does streaming slow down when the service is under heavy load? The batch size grew, increasing per-step decode time. Why does a 70B model respond faster than expected? It's likely quantised to INT4 and running on 8 GPUs with tensor parallelism. The serving stack is not a black box — it's a set of well-understood engineering decisions, each motivated by a specific bottleneck analysis.
Quiz
Test your understanding of production LLM serving at scale.
Why is tensor parallelism (TP) preferred over pipeline parallelism (PP) for latency-sensitive LLM serving?
What is the main problem that prefill-decode disaggregation solves?
As batch size $B$ increases during decode, what happens to per-user TPOT and total throughput?
Which serving framework's key innovation is RadixAttention for sharing KV cache across requests with common prefixes?