The Optimisation Playbook
Articles 1–5 covered GPU hardware, the CUDA programming model, the software stack, and performance analysis. This article connects those concepts to practical optimisation — the techniques that kernel authors (and compilers like Triton) use to write fast GPU code. If the previous articles built the mental model, this one puts it to work.
The optimisation playbook rests on four pillars:
- Maximise arithmetic intensity — do more compute per byte transferred. This pushes your kernel toward the compute-bound side of the roofline, where it can actually use the FLOPS the hardware provides.
- Maximise occupancy — keep enough warps active to hide memory and instruction latency. When one warp stalls on a memory request, another warp should be ready to execute.
- Use memory efficiently — coalesce global memory accesses, tile data into shared memory, and minimise host-device transfers. Memory bandwidth is almost always the bottleneck.
- Minimise overhead — fuse kernels to reduce launch overhead, use CUDA Graphs for inference, and avoid unnecessary host-device synchronisation points.
Each of the following sections dives into one of these techniques. Together, they represent the toolkit that frameworks like PyTorch, compilers like Triton, and libraries like cuBLAS use under the hood to extract maximum performance from the hardware.
Kernel Fusion: The Biggest Win
Kernel fusion is arguably the single most impactful optimisation for deep learning workloads. The idea is straightforward: instead of launching separate kernels for each operation — say, a matrix multiplication, a bias addition, and a ReLU activation — you fuse them into a single kernel that performs all three operations in one pass over the data.
Why does fusion help so much? Three reasons:
- Fewer memory round-trips: without fusion, intermediate results must be written to HBM after each kernel and read back for the next one. With fusion, those intermediates stay in registers or shared memory — never touching HBM at all. For a 3-operation sequence, this can cut HBM traffic by roughly 2/3.
- Fewer kernel launches: each kernel launch incurs approximately 5–10 μs of CPU-side overhead (argument validation, driver calls, GPU scheduling). Fusing 10 tiny kernels into 1 saves around 50 μs — which matters a great deal when your model runs thousands of such sequences per second.
- Higher arithmetic intensity: the fused kernel does more FLOPs per byte loaded from HBM, moving it closer to (or past) the roofline's ridge point. A kernel that was memory-bound as three separate launches may become compute-bound when fused.
Let's quantify the memory savings for a concrete example: a fused matmul + bias + ReLU versus three separate kernels.
import numpy as np
batch, dim = 1024, 4096
elem_size = 2 # bytes per FP16 element
hbm_bw = 3.35e12 # H100 HBM bandwidth (bytes/s)
# Unfused: 3 separate kernels
# Op 1: matmul (read W, X; write Y)
# Op 2: add bias (read Y, bias; write Z)
# Op 3: relu (read Z; write out)
unfused_reads = (batch * dim + dim * dim + batch * dim + dim + batch * dim) * elem_size
unfused_writes = (batch * dim + batch * dim + batch * dim) * elem_size
unfused_total = unfused_reads + unfused_writes
# Fused: 1 kernel
# Read W, X, bias once; write out once
fused_reads = (batch * dim + dim * dim + dim) * elem_size
fused_writes = batch * dim * elem_size
fused_total = fused_reads + fused_writes
savings = 1 - fused_total / unfused_total
# Collect rows for aligned comparison
rows = [
("Unfused (3 kernels)", unfused_total),
("Fused (1 kernel)", fused_total),
]
w_name = max(len(r[0]) for r in rows)
w_mb = max(len(f"{r[1]/1e6:.1f}") for r in rows)
w_ms = max(len(f"{r[1]/hbm_bw*1000:.3f}") for r in rows)
print("Kernel Fusion: matmul + bias + relu")
print(f" Shape: ({batch}, {dim}) × ({dim}, {dim})")
print()
for name, total in rows:
mb = total / 1e6
ms = total / hbm_bw * 1000
print(f" {name:<{w_name}} HBM traffic: {mb:>{w_mb}.1f} MB Time at peak BW: {ms:>{w_ms}.3f} ms")
print()
print(f" Memory traffic reduction: {savings:.0%}")
print(f" This is why torch.compile exists — it fuses operations automatically.")
The numbers speak for themselves. Fusion doesn't change the amount of useful compute — the same FLOPs happen either way — but it dramatically reduces the memory traffic needed to perform them. For memory-bound operations (which most element-wise and reduction operations are), this is the difference between a kernel that bottlenecks on bandwidth and one that runs near peak throughput.
Tiled Matrix Multiplication
Large matrix multiplications cannot fit entirely in registers or shared memory — the matrices are simply too big. The standard approach is tiling : load a small block (tile) of each input matrix into shared memory, compute the partial result for that tile, accumulate the partial result in registers, and repeat until the full computation is done.
Here is the conceptual structure of a tiled matmul kernel:
# Pseudocode for tiled matmul C = A @ B
# A is (M, K), B is (K, N), C is (M, N)
# Tile size: BLOCK_M × BLOCK_K and BLOCK_K × BLOCK_N
for tile_k in range(0, K, BLOCK_K):
# 1. Load tiles from global memory (HBM) → shared memory
A_tile = A[row_start:row_end, tile_k:tile_k+BLOCK_K] # → shared mem
B_tile = B[tile_k:tile_k+BLOCK_K, col_start:col_end] # → shared mem
__syncthreads() # wait for all threads to finish loading
# 2. Compute partial matmul from shared memory (fast!)
C_partial += A_tile @ B_tile # stays in registers
__syncthreads() # wait before loading next tile
# 3. Write accumulated result to global memory
C[row_start:row_end, col_start:col_end] = C_partial
Why does tiling work so well? Shared memory has roughly 5-cycle latency compared to approximately 300 cycles for HBM . By loading a tile once into shared memory and reusing it for many multiply-accumulate operations, we drastically increase arithmetic intensity. A $K \times K$ tile is loaded once but used for $K$ MAC operations per element, giving an intensity that grows proportionally with the tile dimension.
The outer loop over
tile_k
sweeps across the K dimension in chunks of
BLOCK_K
. At each step, both tiles are loaded cooperatively by all threads in the block, a
__syncthreads()
barrier ensures the loads are complete, and then every thread computes its portion of the partial product using the fast shared memory. The result accumulates in registers across all iterations, and only the final result is written back to HBM.
Mixed Precision: More Compute, Less Memory
Modern GPUs are dramatically faster at FP16/BF16 math than FP32. The H100, for example, delivers approximately 990 TFLOPS at FP16 but only around 67 TFLOPS at FP32 — a roughly 15× difference. Mixed precision training exploits this gap by using lower-precision arithmetic where it is safe to do so, while retaining full precision where it matters.
The standard mixed precision recipe has three parts:
- Store weights in FP32 — the master copy of each parameter stays in full precision. This ensures that small gradient updates (which might be lost in FP16's limited precision) are accumulated accurately over many training steps.
- Cast to FP16/BF16 for forward and backward passes — this is where the speed comes from. Tensor Cores operate on FP16/BF16 inputs, and the reduced data size means less memory traffic (FP16 is half the bytes of FP32, effectively doubling memory bandwidth).
- Accumulate in FP32 — Tensor Cores natively perform multiply-accumulate operations with FP16 inputs and FP32 accumulators. This preserves numerical stability during the large reductions that happen in matrix multiplications.
The benefits are twofold: faster compute (Tensor Cores operate on FP16/BF16 data) and lower memory traffic (half the bytes means, in effect, double the usable memory bandwidth). In PyTorch, enabling mixed precision is straightforward:
# PyTorch automatic mixed precision
with torch.autocast(device_type='cuda', dtype=torch.float16):
output = model(input) # forward in FP16
loss = criterion(output, target)
scaler = torch.amp.GradScaler()
scaler.scale(loss).backward() # backward in FP16
scaler.step(optimizer) # update in FP32
scaler.update()
The
GradScaler
deserves a brief explanation. FP16 has a limited dynamic range — roughly
$6 \times 10^{-8}$
to
$65504$
— which means small gradients can underflow to zero during the backward pass. The GradScaler solves this by multiplying the loss by a large scale factor before calling
.backward()
, which keeps the gradient values within FP16's representable range. Before the optimizer step, the gradients are divided by the same scale factor to restore their true magnitude. If overflow is detected (any gradient becomes
inf
or
NaN
), that step is skipped entirely and the scale factor is reduced for the next iteration. Over time, the scaler finds a stable scale factor that avoids both underflow and overflow.
CUDA Graphs: Amortising Launch Overhead
Each kernel launch incurs approximately 5–10 μs of CPU-side overhead: argument validation, driver calls, and GPU scheduling. For large kernels that run for milliseconds, this overhead is negligible. But for inference workloads with many small kernels — attention projections, layer norms, bias additions, activation functions — launch overhead can become a significant fraction of total execution time.
CUDA Graphs solve this by recording a sequence of kernel launches once, capturing the entire dependency graph, and then replaying the whole sequence with a single launch command. The CPU overhead drops from $N \times 5\mu s$ to $1 \times 5\mu s$ — a substantial win when $N$ is in the hundreds.
Here is the typical usage pattern in PyTorch:
# Record a CUDA Graph
static_input = torch.randn(32, 512, device='cuda')
static_output = torch.empty(32, 512, device='cuda')
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
static_output = model(static_input)
# Replay — no CPU-side kernel launch overhead
for batch in dataloader:
static_input.copy_(batch) # copy new data into the recorded buffer
g.replay() # replays all kernels in one go
result = static_output # output is in the pre-allocated buffer
Notice that the input and output tensors are
pre-allocated
— during recording, CUDA captures the exact memory addresses used, and during replay it reuses those same addresses. New data is fed in by copying into the pre-allocated input buffer with
.copy_()
, and the result appears in the pre-allocated output buffer after
.replay()
.
Connecting Back to PyTorch
Everything in this track comes together in
torch.compile
. When you call
torch.compile(model)
, TorchDynamo traces your Python code, TorchInductor lowers it to optimised kernels, and the resulting compiled model incorporates many of the techniques we've discussed:
- Kernel fusion (article 4, this article): TorchInductor fuses element-wise operations, reductions, and pointwise ops into combined Triton kernels, eliminating intermediate HBM round-trips.
- Tiled matmul: cuBLAS handles standard matrix multiplications with heavily optimised tiled kernels; Triton generates tiled kernels for fused operations that include matmuls.
-
Mixed precision:
torch.autocastenables FP16/BF16 computation with FP32 accumulation, unlocking Tensor Core throughput and halving memory traffic. -
CUDA Graphs:
torch.cuda.CUDAGraphcaptures and replays kernel sequences, eliminating per-kernel launch overhead for fixed-shape inference workloads. - Memory coalescing, shared memory, warp management: Triton handles these automatically when generating kernels from TorchInductor's output. The programmer specifies what to compute; Triton decides how to map it onto the hardware.
The GPU track has given you the mental model to understand why these optimisations work. Kernel fusion reduces HBM traffic, moving memory-bound operations toward higher arithmetic intensity. Mixed precision doubles effective bandwidth and enables Tensor Cores. CUDA Graphs eliminate launch overhead. Tiling puts data in shared memory where it can be reused many times before being evicted. None of these are magic — they are direct consequences of the hardware architecture covered in articles 1 through 5.
When someone says a model is "optimised for GPU," what they really mean is that the code is structured to exploit these properties: high arithmetic intensity, high occupancy, efficient memory access patterns, and minimal overhead. Now you know what each of those means — and, more importantly, why each one matters.
Quiz
Test your understanding of practical GPU optimisation techniques — kernel fusion, tiled matmul, mixed precision, and CUDA Graphs.
Why does kernel fusion improve performance for memory-bound operations?
In tiled matrix multiplication, why is data loaded into shared memory before computing?
What two benefits does mixed precision (FP16/BF16) provide?
When are CUDA Graphs most beneficial?