Two Bottlenecks: Compute and Memory
Every GPU kernel is limited by one of two resources: compute (how fast the cores can crunch numbers) or memory bandwidth (how fast data can flow between memory and cores).
- Compute-bound: the cores are the bottleneck. Making memory faster wouldn't help — the cores can't process data any faster. Example: a large matmul where the data fits in cache and the Tensor Cores are fully utilised.
- Memory-bound: the bandwidth is the bottleneck. Adding more cores wouldn't help — they're waiting for data. Example: element-wise operations like ReLU or layer normalisation, where each element requires very little compute but must be loaded from and written back to HBM.
Most deep learning operations are memory-bound. This is counterintuitive — GPUs have thousands of cores! But modern GPUs have so much compute capacity (hundreds of TFLOPS) that only the most arithmetic-intensive operations (large matmuls, convolutions) can keep the cores busy.
Arithmetic Intensity
How do you tell whether a kernel is compute-bound or memory-bound? Compute the arithmetic intensity :
FLOPs is the number of floating-point operations the kernel performs. Bytes transferred is the total data moved between GPU memory and the SM (reads + writes). The ratio
$I$
tells us how much computation we do per byte of data moved, measured in FLOPs/byte.
Consider two extremes:
-
Low $I$ (e.g., 0.5 FLOPs/byte):
very little compute per byte. The kernel spends most of its time waiting for data. Example: vector addition — 1 add per 12 bytes loaded (4 bytes for each of 2 inputs + 4 bytes written = 12 bytes, 1 FLOP).
$I = 1/12 \approx 0.08$. -
High $I$ (e.g., 100 FLOPs/byte):
lots of compute per byte. The kernel keeps the cores busy. Example: large matmul — for
$n \times n$matrices,$2n^3$FLOPs but only$3 \times 4n^2$bytes (read two matrices + write one, at FP32).$I = 2n^3 / 12n^2 = n/6$. For$n = 4096$,$I \approx 683$FLOPs/byte.
Let's compute the arithmetic intensity for several common operations:
import numpy as np
n = 4096
elem = 4 # bytes per float32
vec_n = 1_000_000
seq, d = 2048, 128
# Collect all row data first
rows = [
("Vector add (1M)", vec_n, 3 * vec_n * elem),
("ReLU (1M)", vec_n, 2 * vec_n * elem),
("LayerNorm (1M)", 5 * vec_n, 2 * vec_n * elem),
(f"Matmul ({n}\u00d7{n})", 2 * n**3, 3 * n**2 * elem),
(f"Attention QK^T ({seq}\u00d7{d})", 2*seq**2*d, (2*seq*d + seq**2) * elem),
]
# Compute max widths per column
w_name = max(len(r[0]) for r in rows)
w_flops = max(len(f"{r[1]:,.0f}") for r in rows)
w_bytes = max(len(f"{r[2]:,.0f}") for r in rows)
w_ratio = max(len(f"{r[1]/r[2]:.1f}") for r in rows)
print("Arithmetic Intensity Examples (FP32):")
print()
for name, flops, bts in rows:
I = flops / bts
print(f" {name:<{w_name}} {flops:>{w_flops},.0f} FLOPs / {bts:>{w_bytes},.0f} bytes = {I:>{w_ratio}.1f} FLOPs/byte")
print()
print("Low intensity → memory-bound (waiting for data)")
print("High intensity → compute-bound (cores are the limit)")
The Roofline Model
The roofline model (Williams et al., 2008) is a visual tool that shows, at a glance, whether a kernel is compute-bound or memory-bound. It is arguably the most useful back-of-the-envelope model for GPU performance analysis (see the Modal GPU Glossary for an interactive treatment).
The model plots two hardware limits:
- Compute ceiling (horizontal line): the GPU's peak compute throughput in FLOPS/s (e.g., ~990 TFLOPS FP16 for H100 SXM).
- Memory bandwidth slope (diagonal line): the GPU's peak memory bandwidth in bytes/s (e.g., ~3.35 TB/s for H100 HBM3).
A kernel's achievable performance is given by:
This formula captures the two regimes:
-
If
$I \times \text{Peak BW} < \text{Peak FLOPS/s}$: the kernel is memory-bound . Performance is limited by how fast data arrives, not by compute. Increasing$I$(e.g., by fusing operations) would help. -
If
$I \times \text{Peak BW} \geq \text{Peak FLOPS/s}$: the kernel is compute-bound . It's already saturating the cores. Only faster hardware or algorithmic improvements help.
The
ridge point
is the arithmetic intensity where the two roofs meet:
$I_{\text{ridge}} = \text{Peak FLOPS/s} / \text{Peak BW}$
. Kernels below this intensity are memory-bound; above it, compute-bound.
Let's compute the ridge point for an H100 and classify several common operations:
import json, js
# H100 SXM specs (approximate)
peak_flops = 990e12 # ~990 TFLOPS (FP16 Tensor Core)
peak_bw = 3.35e12 # ~3.35 TB/s (HBM3)
ridge_point = peak_flops / peak_bw
print(f"H100 SXM Roofline:")
print(f" Peak compute: {peak_flops/1e12:.0f} TFLOPS (FP16)")
print(f" Peak bandwidth: {peak_bw/1e12:.2f} TB/s")
print(f" Ridge point: {ridge_point:.0f} FLOPs/byte")
print()
# Classify operations
ops = [
("Vector add", 0.08),
("ReLU", 0.12),
("LayerNorm", 0.62),
("Softmax", 1.5),
("Attention QK^T", 64),
("Matmul 4096²", 683),
]
rows = []
for name, intensity in ops:
achievable = min(peak_flops, intensity * peak_bw)
pct = achievable / peak_flops * 100
bound = "memory" if intensity < ridge_point else "compute"
rows.append([name, f"{intensity:.1f}", bound, f"{pct:.1f}%"])
js.window.py_table_data = json.dumps({
"headers": ["Operation", "Intensity (FLOPs/byte)", "Bottleneck", "% of Peak"],
"rows": rows
})
print("Most DL ops (element-wise, normalization, softmax) are memory-bound.")
print("Only large matmuls and convolutions cross the ridge point.")
Little's Law: Parallelism Hides Latency
Little's Law, borrowed from queueing theory, provides the fundamental relationship between throughput, parallelism, and latency:
To achieve a given throughput, you need enough parallelism to keep the pipeline full while individual operations complete. If a memory request takes 300 clock cycles (latency), and you want to issue one request per cycle (throughput = 1/cycle), you need 300 requests in flight simultaneously (parallelism = 300).
This is exactly why GPUs need thousands of concurrent threads: they hide the ~300-cycle memory latency by having enough warps ready to execute while others wait. If the SM has only a few active warps, it stalls frequently (low occupancy ). With many warps, there's always one ready to run (high occupancy = good latency hiding).
import numpy as np
mem_latency_cycles = 300 # typical HBM latency
clock_ghz = 1.8 # H100 SM clock
# How many warps needed to hide memory latency?
instructions_per_warp_per_cycle = 1
warps_needed = mem_latency_cycles * instructions_per_warp_per_cycle
print("Little's Law on a GPU SM:")
print(f" Memory latency: {mem_latency_cycles} cycles")
print(f" To hide latency: need {warps_needed} warps in flight")
print(f" Max warps per SM: 64 (H100)")
print(f" Minimum occupancy for full hiding: {warps_needed/64:.0%}")
print()
print("This is why occupancy matters: more active warps = more")
print("choices for the scheduler = fewer stall cycles.")
print()
# Collect rows for aligned output
rows = []
for occupancy_pct in [25, 50, 75, 100]:
active_warps = int(64 * occupancy_pct / 100)
hiding_ratio = min(1.0, active_warps / warps_needed)
stall_pct = (1 - hiding_ratio) * 100
rows.append((occupancy_pct, active_warps, stall_pct))
w_stall = max(len(f"{r[2]:.0f}") for r in rows)
for occ, warps, stall in rows:
print(f" {occ:>3}% occupancy ({warps:>2} warps): "
f"~{stall:>{w_stall}.0f}% potential stall cycles")
Common Bottlenecks
Below is a checklist of the most common performance problems on GPUs. Each represents a different way that hardware resources are left idle or underutilised.
-
Register pressure:
the kernel uses too many registers per thread, which means fewer active threads can fit on the SM, which lowers occupancy, which reduces the SM's ability to hide memory latency. Fix: simplify the kernel, use
__launch_bounds__to hint the compiler, or accept the tradeoff if each thread does enough compute to compensate. - Bank conflicts: when multiple threads in a warp access the same shared memory bank simultaneously, the accesses are serialised. A 32-way bank conflict (all threads hitting the same bank) makes shared memory 32× slower. Fix: pad shared memory arrays or restructure access patterns so adjacent threads access different banks.
- Warp divergence: threads in a warp take different branches, causing both paths to execute sequentially with inactive threads masked off. Fix: restructure code so threads in the same warp take the same path (e.g., sort data by category before processing).
- Uncoalesced memory access: threads access scattered global memory addresses, triggering many small transactions instead of few large ones. Fix: use Structure of Arrays (SoA) instead of Array of Structures (AoS), and ensure adjacent threads access adjacent memory addresses.
-
Kernel launch overhead:
each kernel launch has roughly 5–10 µs of overhead (CPU → driver → GPU scheduling). For tiny kernels, this overhead dominates actual compute time. Fix: fuse kernels (
torch.compiledoes this automatically) or use CUDA Graphs to batch launches into a single submission. -
Host-device transfer:
copying data between CPU and GPU (
cudaMemcpy) goes through PCIe (~32 GB/s for Gen4 x16), which is orders of magnitude slower than GPU memory bandwidth (~3 TB/s). Fix: minimise transfers, use pinned memory, and overlap compute with transfers using CUDA streams.
Quiz
Test your understanding of GPU performance analysis — arithmetic intensity, the roofline model, latency hiding, and common bottlenecks.
A kernel performs 100 FLOPs per element and transfers 8 bytes per element. What is its arithmetic intensity?
In the roofline model, what determines whether a kernel is compute-bound or memory-bound?
According to Little's Law, what must increase to hide higher memory latency?
Why does register pressure reduce performance even though each thread runs faster with more registers?