Writing Fast GPU Code: From Theory to Practice

The Optimisation Playbook

Articles 1–5 covered GPU hardware, the CUDA programming model, the software stack, and performance analysis. This article connects those concepts to practical optimisation — the techniques that kernel authors (and compilers like Triton) use to write fast GPU code. If the previous articles built the mental model, this one puts it to work.

The optimisation playbook rests on four pillars:

Maximise arithmetic intensity — do more compute per byte transferred. This pushes your kernel toward the compute-bound side of the roofline, where it can actually use the FLOPS the hardware provides.
Maximise occupancy — keep enough warps active to hide memory and instruction latency. When one warp stalls on a memory request, another warp should be ready to execute.
Use memory efficiently — coalesce global memory accesses, tile data into shared memory, and minimise host-device transfers. Memory bandwidth is almost always the bottleneck.
Minimise overhead — fuse kernels to reduce launch overhead, use CUDA Graphs for inference, and avoid unnecessary host-device synchronisation points.

Each of the following sections dives into one of these techniques. Together, they represent the toolkit that frameworks like PyTorch, compilers like Triton, and libraries like cuBLAS use under the hood to extract maximum performance from the hardware.

Kernel Fusion: The Biggest Win

Kernel fusion is arguably the single most impactful optimisation for deep learning workloads. The idea is straightforward: instead of launching separate kernels for each operation — say, a matrix multiplication, a bias addition, and a ReLU activation — you fuse them into a single kernel that performs all three operations in one pass over the data.

Why does fusion help so much? Three reasons:

Fewer memory round-trips: without fusion, intermediate results must be written to HBM after each kernel and read back for the next one. With fusion, those intermediates stay in registers or shared memory — never touching HBM at all. For a 3-operation sequence, this can cut HBM traffic by roughly 2/3.
Fewer kernel launches: each kernel launch incurs approximately 5–10 μs of CPU-side overhead (argument validation, driver calls, GPU scheduling). Fusing 10 tiny kernels into 1 saves around 50 μs — which matters a great deal when your model runs thousands of such sequences per second.
Higher arithmetic intensity: the fused kernel does more FLOPs per byte loaded from HBM, moving it closer to (or past) the roofline's ridge point. A kernel that was memory-bound as three separate launches may become compute-bound when fused.

Let's quantify the memory savings for a concrete example: a fused matmul + bias + ReLU versus three separate kernels.

import json, js

batch, dim = 1024, 4096
elem_size = 2  # bytes per FP16 element
hbm_bw = 3.35e12  # H100 HBM bandwidth (bytes/s)

# Unfused: 3 separate kernels
# Op 1: matmul (read W, X; write Y)
# Op 2: add bias (read Y, bias; write Z)
# Op 3: relu (read Z; write out)
unfused_reads = (batch * dim + dim * dim + batch * dim + dim + batch * dim) * elem_size
unfused_writes = (batch * dim + batch * dim + batch * dim) * elem_size
unfused_total = unfused_reads + unfused_writes

# Fused: 1 kernel
# Read W, X, bias once; write out once
fused_reads = (batch * dim + dim * dim + dim) * elem_size
fused_writes = batch * dim * elem_size
fused_total = fused_reads + fused_writes

savings = 1 - fused_total / unfused_total

raw = [
    ("Unfused (3 kernels)", unfused_total),
    ("Fused (1 kernel)",    fused_total),
]

rows = []
for name, total in raw:
    mb = total / 1e6
    ms = total / hbm_bw * 1000
    rows.append([name, f"{mb:.1f} MB", f"{ms:.3f} ms"])

print(f"Shape: ({batch}, {dim}) × ({dim}, {dim})")
print(f"Memory traffic reduction: {savings:.0%}")
print("This is why torch.compile exists — it fuses operations automatically.")

js.window.py_table_data = json.dumps({
    "headers": ["Kernel", "HBM Traffic", "Time at Peak BW"],
    "rows": rows
})

The numbers speak for themselves. Fusion doesn't change the amount of useful compute — the same FLOPs happen either way — but it dramatically reduces the memory traffic needed to perform them. For memory-bound operations (which most element-wise and reduction operations are), this is the difference between a kernel that bottlenecks on bandwidth and one that runs near peak throughput.

Tiled Matrix Multiplication

Large matrix multiplications cannot fit entirely in registers or shared memory — the matrices are simply too big. The standard approach is tiling : load a small block (tile) of each input matrix into shared memory, compute the partial result for that tile, accumulate the partial result in registers, and repeat until the full computation is done.

Here is the conceptual structure of a tiled matmul kernel:

# Pseudocode for tiled matmul C = A @ B
# A is (M, K), B is (K, N), C is (M, N)
# Tile size: BLOCK_M × BLOCK_K and BLOCK_K × BLOCK_N

for tile_k in range(0, K, BLOCK_K):
    # 1. Load tiles from global memory (HBM) → shared memory
    A_tile = A[row_start:row_end, tile_k:tile_k+BLOCK_K]  # → shared mem
    B_tile = B[tile_k:tile_k+BLOCK_K, col_start:col_end]  # → shared mem
    __syncthreads()  # wait for all threads to finish loading

    # 2. Compute partial matmul from shared memory (fast!)
    C_partial += A_tile @ B_tile  # stays in registers

    __syncthreads()  # wait before loading next tile

# 3. Write accumulated result to global memory
C[row_start:row_end, col_start:col_end] = C_partial

Why does tiling work so well? Shared memory has roughly 5-cycle latency compared to approximately 300 cycles for HBM . By loading a tile once into shared memory and reusing it for many multiply-accumulate operations, we drastically increase arithmetic intensity. A $K \times K$ tile is loaded once but used for $K$ MAC operations per element, giving an intensity that grows proportionally with the tile dimension.

The outer loop over tile_k sweeps across the K dimension in chunks of BLOCK_K . At each step, both tiles are loaded cooperatively by all threads in the block, a __syncthreads() barrier ensures the loads are complete, and then every thread computes its portion of the partial product using the fast shared memory. The result accumulates in registers across all iterations, and only the final result is written back to HBM.

💡 cuBLAS implements tiled matmul with Tensor Core operations, double-buffering (loading the next tile while computing on the current one), and architecture-specific tuning. Triton generates similar tiled kernels automatically from higher-level descriptions. You almost never need to write tiled matmul by hand.

Mixed Precision: More Compute, Less Memory

Modern GPUs are dramatically faster at FP16/BF16 math than FP32. The H100, for example, delivers approximately 990 TFLOPS at FP16 but only around 67 TFLOPS at FP32 — a roughly 15× difference. Mixed precision training exploits this gap by using lower-precision arithmetic where it is safe to do so, while retaining full precision where it matters.

The standard mixed precision recipe has three parts:

Store weights in FP32 — the master copy of each parameter stays in full precision. This ensures that small gradient updates (which might be lost in FP16's limited precision) are accumulated accurately over many training steps.
Cast to FP16/BF16 for forward and backward passes — this is where the speed comes from. Tensor Cores operate on FP16/BF16 inputs, and the reduced data size means less memory traffic (FP16 is half the bytes of FP32, effectively doubling memory bandwidth).
Accumulate in FP32 — Tensor Cores natively perform multiply-accumulate operations with FP16 inputs and FP32 accumulators. This preserves numerical stability during the large reductions that happen in matrix multiplications.

The benefits are twofold: faster compute (Tensor Cores operate on FP16/BF16 data) and lower memory traffic (half the bytes means, in effect, double the usable memory bandwidth). In PyTorch, enabling mixed precision is straightforward:

# PyTorch automatic mixed precision
with torch.autocast(device_type='cuda', dtype=torch.float16):
    output = model(input)    # forward in FP16
    loss = criterion(output, target)

scaler = torch.amp.GradScaler()
scaler.scale(loss).backward()  # backward in FP16
scaler.step(optimizer)          # update in FP32
scaler.update()

The GradScaler deserves a brief explanation. FP16 has a limited dynamic range — roughly $6 \times 10^{-8}$ to $65504$ — which means small gradients can underflow to zero during the backward pass. The GradScaler solves this by multiplying the loss by a large scale factor before calling .backward() , which keeps the gradient values within FP16's representable range. Before the optimizer step, the gradients are divided by the same scale factor to restore their true magnitude. If overflow is detected (any gradient becomes inf or NaN ), that step is skipped entirely and the scale factor is reduced for the next iteration. Over time, the scaler finds a stable scale factor that avoids both underflow and overflow.

💡 BF16 (bfloat16) has the same exponent range as FP32 but with reduced mantissa precision. Because the dynamic range matches FP32, BF16 training typically does not need a GradScaler at all — underflow is far less likely. This is why BF16 has become the default precision for large language model training.

CUDA Graphs: Amortising Launch Overhead

Each kernel launch incurs approximately 5–10 μs of CPU-side overhead: argument validation, driver calls, and GPU scheduling. For large kernels that run for milliseconds, this overhead is negligible. But for inference workloads with many small kernels — attention projections, layer norms, bias additions, activation functions — launch overhead can become a significant fraction of total execution time.

CUDA Graphs solve this by recording a sequence of kernel launches once, capturing the entire dependency graph, and then replaying the whole sequence with a single launch command. The CPU overhead drops from $N \times 5\mu s$ to $1 \times 5\mu s$ — a substantial win when $N$ is in the hundreds.

Here is the typical usage pattern in PyTorch:

# Record a CUDA Graph
static_input = torch.randn(32, 512, device='cuda')
static_output = torch.empty(32, 512, device='cuda')

g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
    static_output = model(static_input)

# Replay — no CPU-side kernel launch overhead
for batch in dataloader:
    static_input.copy_(batch)  # copy new data into the recorded buffer
    g.replay()                  # replays all kernels in one go
    result = static_output      # output is in the pre-allocated buffer

Notice that the input and output tensors are pre-allocated — during recording, CUDA captures the exact memory addresses used, and during replay it reuses those same addresses. New data is fed in by copying into the pre-allocated input buffer with .copy_() , and the result appears in the pre-allocated output buffer after .replay() .

📌 CUDA Graphs require fixed-size inputs and a fixed computation graph — no dynamic shapes, no data-dependent branches. This makes them ideal for inference (fixed batch size, fixed sequence length) but less useful for training with variable-length inputs.

Connecting Back to PyTorch

Everything in this track comes together in torch.compile . When you call torch.compile(model) , TorchDynamo traces your Python code, TorchInductor lowers it to optimised kernels, and the resulting compiled model incorporates many of the techniques we've discussed:

Kernel fusion (article 4, this article): TorchInductor fuses element-wise operations, reductions, and pointwise ops into combined Triton kernels, eliminating intermediate HBM round-trips.
Tiled matmul: cuBLAS handles standard matrix multiplications with heavily optimised tiled kernels; Triton generates tiled kernels for fused operations that include matmuls.
Mixed precision: torch.autocast enables FP16/BF16 computation with FP32 accumulation, unlocking Tensor Core throughput and halving memory traffic.
CUDA Graphs: torch.cuda.CUDAGraph captures and replays kernel sequences, eliminating per-kernel launch overhead for fixed-shape inference workloads.
Memory coalescing, shared memory, warp management: Triton handles these automatically when generating kernels from TorchInductor's output. The programmer specifies what to compute; Triton decides how to map it onto the hardware.

The GPU track has given you the mental model to understand why these optimisations work. Kernel fusion reduces HBM traffic, moving memory-bound operations toward higher arithmetic intensity. Mixed precision doubles effective bandwidth and enables Tensor Cores. CUDA Graphs eliminate launch overhead. Tiling puts data in shared memory where it can be reused many times before being evicted. None of these are magic — they are direct consequences of the hardware architecture covered in articles 1 through 5.

When someone says a model is "optimised for GPU," what they really mean is that the code is structured to exploit these properties: high arithmetic intensity, high occupancy, efficient memory access patterns, and minimal overhead. Now you know what each of those means — and, more importantly, why each one matters.

Quiz

Test your understanding of practical GPU optimisation techniques — kernel fusion, tiled matmul, mixed precision, and CUDA Graphs.

Why does kernel fusion improve performance for memory-bound operations?

It allows the GPU to use more SMs simultaneously

It reduces HBM traffic by keeping intermediate results in registers/shared memory instead of writing them to HBM and reading them back

It increases the GPU clock speed for fused operations

It enables the CPU to process operations in parallel with the GPU

In tiled matrix multiplication, why is data loaded into shared memory before computing?

Shared memory is larger than HBM, so more data can fit

Shared memory has ~5-cycle latency vs ~300 cycles for HBM, and each tile element is reused many times, increasing arithmetic intensity

Shared memory supports atomic operations that HBM does not

The GPU hardware requires data to pass through shared memory before reaching the ALUs

What two benefits does mixed precision (FP16/BF16) provide?

Larger model capacity and faster data loading from disk

Better numerical accuracy and reduced power consumption

Faster compute (Tensor Cores are ~15× faster at FP16) and lower memory traffic (half the bytes, effectively doubling bandwidth)

Simpler code and better debugging support

When are CUDA Graphs most beneficial?

For training with large batch sizes and dynamic sequence lengths

For inference with many small kernels where launch overhead (~5-10μs per kernel) dominates the execution time

For data preprocessing on the CPU before transferring to the GPU

For distributed training across multiple GPUs