Performance: The Roofline Model and Bottlenecks

Two Bottlenecks: Compute and Memory

Every GPU kernel is limited by one of two resources: compute (how fast the cores can crunch numbers) or memory bandwidth (how fast data can flow between memory and cores).

Compute-bound: the cores are the bottleneck. Making memory faster wouldn't help — the cores can't process data any faster. Example: a large matmul where the data fits in cache and the Tensor Cores are fully utilised.
Memory-bound: the bandwidth is the bottleneck. Adding more cores wouldn't help — they're waiting for data. Example: element-wise operations like ReLU or layer normalisation, where each element requires very little compute but must be loaded from and written back to HBM.

Most deep learning operations are memory-bound. This is counterintuitive — GPUs have thousands of cores! But modern GPUs have so much compute capacity (hundreds of TFLOPS) that only the most arithmetic-intensive operations (large matmuls, convolutions) can keep the cores busy.

Arithmetic Intensity

How do you tell whether a kernel is compute-bound or memory-bound? Compute the arithmetic intensity :

FLOPs is the number of floating-point operations the kernel performs. Bytes transferred is the total data moved between GPU memory and the SM (reads + writes). The ratio $I$ tells us how much computation we do per byte of data moved, measured in FLOPs/byte.

Consider two extremes:

Low $I$ (e.g., 0.5 FLOPs/byte): very little compute per byte. The kernel spends most of its time waiting for data. Example: vector addition — 1 add per 12 bytes loaded (4 bytes for each of 2 inputs + 4 bytes written = 12 bytes, 1 FLOP). $I = 1/12 \approx 0.08$ .
High $I$ (e.g., 100 FLOPs/byte): lots of compute per byte. The kernel keeps the cores busy. Example: large matmul — for $n \times n$ matrices, $2n^3$ FLOPs but only $3 \times 4n^2$ bytes (read two matrices + write one, at FP32). $I = 2n^3 / 12n^2 = n/6$ . For $n = 4096$ , $I \approx 683$ FLOPs/byte.

Let's compute the arithmetic intensity for several common operations:

import json, js

n = 4096
elem = 4  # bytes per float32
vec_n = 1_000_000
seq, d = 2048, 128

# Collect all row data first
raw = [
    ("Vector add (1M)",            vec_n,              3 * vec_n * elem),
    ("ReLU (1M)",                  vec_n,              2 * vec_n * elem),
    ("LayerNorm (1M)",             5 * vec_n,          2 * vec_n * elem),
    (f"Matmul ({n}\u00d7{n})",    2 * n**3,           3 * n**2 * elem),
    (f"Attention QK^T ({seq}\u00d7{d})", 2*seq**2*d,  (2*seq*d + seq**2) * elem),
]

rows = []
for name, flops, bts in raw:
    I = flops / bts
    rows.append([name, f"{flops:,.0f}", f"{bts:,.0f}", f"{I:.1f}"])

print("Low intensity → memory-bound (waiting for data)")
print("High intensity → compute-bound (cores are the limit)")

js.window.py_table_data = json.dumps({
    "headers": ["Operation", "FLOPs", "Bytes", "FLOPs/byte"],
    "rows": rows
})

The Roofline Model

The roofline model (Williams et al., 2008) is a visual tool that shows, at a glance, whether a kernel is compute-bound or memory-bound. It is arguably the most useful back-of-the-envelope model for GPU performance analysis (see the Modal GPU Glossary for an interactive treatment).

The model plots two hardware limits:

Compute ceiling (horizontal line): the GPU's peak compute throughput in FLOPS/s (e.g., ~990 TFLOPS FP16 for H100 SXM).
Memory bandwidth slope (diagonal line): the GPU's peak memory bandwidth in bytes/s (e.g., ~3.35 TB/s for H100 HBM3).

A kernel's achievable performance is given by:

This formula captures the two regimes:

If $I \times \text{Peak BW} < \text{Peak FLOPS/s}$ : the kernel is memory-bound . Performance is limited by how fast data arrives, not by compute. Increasing $I$ (e.g., by fusing operations) would help.
If $I \times \text{Peak BW} \geq \text{Peak FLOPS/s}$ : the kernel is compute-bound . It's already saturating the cores. Only faster hardware or algorithmic improvements help.

The ridge point is the arithmetic intensity where the two roofs meet: $I_{\text{ridge}} = \text{Peak FLOPS/s} / \text{Peak BW}$ . Kernels below this intensity are memory-bound; above it, compute-bound.

Let's compute the ridge point for an H100 and classify several common operations:

import json, js

# H100 SXM specs (approximate)
peak_flops = 990e12    # ~990 TFLOPS (FP16 Tensor Core)
peak_bw = 3.35e12      # ~3.35 TB/s (HBM3)

ridge_point = peak_flops / peak_bw
print(f"H100 SXM Roofline:")
print(f"  Peak compute:  {peak_flops/1e12:.0f} TFLOPS (FP16)")
print(f"  Peak bandwidth: {peak_bw/1e12:.2f} TB/s")
print(f"  Ridge point:    {ridge_point:.0f} FLOPs/byte")
print()

# Classify operations
ops = [
    ("Vector add",      0.08),
    ("ReLU",            0.12),
    ("LayerNorm",       0.62),
    ("Softmax",         1.5),
    ("Attention QK^T",  64),
    ("Matmul 4096²",    683),
]
rows = []
for name, intensity in ops:
    achievable = min(peak_flops, intensity * peak_bw)
    pct = achievable / peak_flops * 100
    bound = "memory" if intensity < ridge_point else "compute"
    rows.append([name, f"{intensity:.1f}", bound, f"{pct:.1f}%"])

js.window.py_table_data = json.dumps({
    "headers": ["Operation", "Intensity (FLOPs/byte)", "Bottleneck", "% of Peak"],
    "rows": rows
})

print("Most DL ops (element-wise, normalization, softmax) are memory-bound.")
print("Only large matmuls and convolutions cross the ridge point.")

💡 The roofline model is an idealisation — real performance is typically lower than the roofline prediction due to factors like cache misses, bank conflicts, and warp divergence. But it remains an excellent first-pass diagnostic: if your kernel achieves less than 50% of the roofline, there's likely a structural inefficiency worth investigating.

Little's Law: Parallelism Hides Latency

Little's Law, borrowed from queueing theory, provides the fundamental relationship between throughput, parallelism, and latency:

To achieve a given throughput, you need enough parallelism to keep the pipeline full while individual operations complete. If a memory request takes 300 clock cycles (latency), and you want to issue one request per cycle (throughput = 1/cycle), you need 300 requests in flight simultaneously (parallelism = 300).

This is exactly why GPUs need thousands of concurrent threads: they hide the ~300-cycle memory latency by having enough warps ready to execute while others wait. If the SM has only a few active warps, it stalls frequently (low occupancy ). With many warps, there's always one ready to run (high occupancy = good latency hiding).

import numpy as np

mem_latency_cycles = 300  # typical HBM latency
clock_ghz = 1.8           # H100 SM clock

# How many warps needed to hide memory latency?
instructions_per_warp_per_cycle = 1
warps_needed = mem_latency_cycles * instructions_per_warp_per_cycle

print("Little's Law on a GPU SM:")
print(f"  Memory latency:   {mem_latency_cycles} cycles")
print(f"  To hide latency:  need {warps_needed} warps in flight")
print(f"  Max warps per SM: 64 (H100)")
print(f"  Minimum occupancy for full hiding: {warps_needed/64:.0%}")
print()
print("This is why occupancy matters: more active warps = more")
print("choices for the scheduler = fewer stall cycles.")
print()

# Collect rows for aligned output
rows = []
for occupancy_pct in [25, 50, 75, 100]:
    active_warps = int(64 * occupancy_pct / 100)
    hiding_ratio = min(1.0, active_warps / warps_needed)
    stall_pct = (1 - hiding_ratio) * 100
    rows.append((occupancy_pct, active_warps, stall_pct))

w_stall = max(len(f"{r[2]:.0f}") for r in rows)

for occ, warps, stall in rows:
    print(f"  {occ:>3}% occupancy ({warps:>2} warps): "
          f"~{stall:>{w_stall}.0f}% potential stall cycles")

💡 The 300-warp figure exceeds what a single SM can hold (64 warps max on H100). In practice, full latency hiding is rarely achievable by occupancy alone — instruction-level parallelism (ILP) and memory-level parallelism (multiple outstanding loads per thread) also contribute. Occupancy is necessary but not always sufficient.

Common Bottlenecks

Below is a checklist of the most common performance problems on GPUs. Each represents a different way that hardware resources are left idle or underutilised.

Register pressure: the kernel uses too many registers per thread, which means fewer active threads can fit on the SM, which lowers occupancy, which reduces the SM's ability to hide memory latency. Fix: simplify the kernel, use __launch_bounds__ to hint the compiler, or accept the tradeoff if each thread does enough compute to compensate.
Bank conflicts: when multiple threads in a warp access the same shared memory bank simultaneously, the accesses are serialised. A 32-way bank conflict (all threads hitting the same bank) makes shared memory 32× slower. Fix: pad shared memory arrays or restructure access patterns so adjacent threads access different banks.
Warp divergence: threads in a warp take different branches, causing both paths to execute sequentially with inactive threads masked off. Fix: restructure code so threads in the same warp take the same path (e.g., sort data by category before processing).
Uncoalesced memory access: threads access scattered global memory addresses, triggering many small transactions instead of few large ones. Fix: use Structure of Arrays (SoA) instead of Array of Structures (AoS), and ensure adjacent threads access adjacent memory addresses.
Kernel launch overhead: each kernel launch has roughly 5–10 µs of overhead (CPU → driver → GPU scheduling). For tiny kernels, this overhead dominates actual compute time. Fix: fuse kernels ( torch.compile does this automatically) or use CUDA Graphs to batch launches into a single submission.
Host-device transfer: copying data between CPU and GPU ( cudaMemcpy ) goes through PCIe (~32 GB/s for Gen4 x16), which is orders of magnitude slower than GPU memory bandwidth (~3 TB/s). Fix: minimise transfers, use pinned memory, and overlap compute with transfers using CUDA streams.

💡 NVIDIA's Nsight Compute profiler can identify most of these bottlenecks automatically. It reports achieved occupancy, memory throughput, compute throughput, and roofline position for each kernel — making it the go-to tool for GPU performance analysis.

Quiz

Test your understanding of GPU performance analysis — arithmetic intensity, the roofline model, latency hiding, and common bottlenecks.

A kernel performs 100 FLOPs per element and transfers 8 bytes per element. What is its arithmetic intensity?

800 FLOPs/byte

12.5 FLOPs/byte

0.08 FLOPs/byte

108 FLOPs/byte

In the roofline model, what determines whether a kernel is compute-bound or memory-bound?

The number of threads launched by the kernel

Whether its arithmetic intensity is above or below the ridge point (peak FLOPS/s ÷ peak bandwidth)

The amount of shared memory allocated per block

Whether the kernel uses Tensor Cores or CUDA Cores

According to Little's Law, what must increase to hide higher memory latency?

Clock frequency (faster cores)

Cache size (more L2)

Parallelism (more active warps/threads in flight)

Memory bandwidth (faster HBM)

Why does register pressure reduce performance even though each thread runs faster with more registers?

More registers consume more power, causing the GPU to throttle its clock speed

Registers are slower than shared memory for large data sets

More registers per thread means fewer active threads, lowering occupancy and the SM's ability to hide memory latency

The compiler cannot optimise kernels that use more than 32 registers per thread