The Thread Hierarchy
CUDA organises parallel work into a three-level hierarchy: threads → thread blocks → grids . Understanding this hierarchy is essential because it maps directly to the hardware: threads execute on CUDA Cores, thread blocks map to Streaming Multiprocessors (SMs), and a grid spans the entire GPU. If the mapping feels mechanical, that's by design — CUDA's programming model is a thin abstraction over the silicon, and keeping it close to the metal is what makes GPUs fast.
Grid (the entire GPU)
├── Block (0,0) ← maps to an SM
│ ├── Warp 0 (threads 0-31) ← 32 threads execute in lockstep
│ ├── Warp 1 (threads 32-63)
│ └── ...
├── Block (0,1) ← maps to another SM
│ ├── Warp 0
│ └── ...
└── Block (1,0)
└── ...
Let's unpack each level from the bottom up.
Thread: the smallest unit of execution. Each thread has its own registers and program counter, but it executes the same code as every other thread — just with different data. This is NVIDIA's SIMT (Single Instruction, Multiple Threads) model: one instruction stream, many data streams. If you've seen SIMD on CPUs, SIMT is a close cousin, but with the crucial difference that each thread has its own control flow (more on this in the divergence section below).
Warp: a group of exactly 32 threads that execute in lockstep. The hardware always schedules and executes warps, never individual threads. When a warp scheduler issues an instruction, all 32 threads in the warp execute it simultaneously. You can think of the warp as the true atomic unit of GPU execution — everything below it is a programming convenience, and everything above it is an organisational tool.
Thread Block (Cooperative Thread Array / CTA):
a group of warps — up to 1,024 threads in most architectures — that execute on the same SM. Threads within a block can cooperate via
shared memory
and synchronisation barriers (
__syncthreads()
). This is the level where inter-thread communication happens. Threads in
different
blocks cannot directly communicate — they are independent by design, which is what allows the GPU scheduler to distribute blocks freely across SMs.
Grid: the collection of all thread blocks launched by a single kernel call. The GPU's block scheduler distributes blocks across the available SMs. A grid can contain millions of threads, and the hardware handles all the scheduling — the programmer simply declares the grid dimensions and lets the GPU figure out when and where each block runs.
Why 32? The Warp as Execution Unit
The number 32 is baked into NVIDIA GPU hardware. Every instruction is issued to exactly 32 threads simultaneously — no more, no less. This seemingly arbitrary constant has profound implications for how you write and optimise GPU code.
When you launch a kernel with, say, 100 threads in a block, the hardware doesn't execute exactly 100 threads. It rounds up to the next multiple of 32 — in this case, 4 warps (128 threads). The last 28 threads in warp 3 are masked off : they go through the motions of executing each instruction, but their results are discarded. This means launching 33 threads wastes almost as many resources as launching 64 — a fact that surprises many newcomers to GPU programming.
Let's quantify the waste for different thread counts:
import numpy as np
def warps_needed(num_threads):
return (num_threads + 31) // 32 # ceiling division by 32
# Collect all row data first
thread_counts = [1, 32, 33, 64, 100, 128, 256, 1000]
rows = []
for n in thread_counts:
warps = warps_needed(n)
total = warps * 32
wasted = total - n
efficiency = n / total * 100
rows.append((n, warps, total, wasted, efficiency))
# Compute max width per column for alignment
w_eff = max(len(f"{r[4]:.0f}") for r in rows)
# Print aligned table
for n, warps, total, wasted, eff in rows:
print(f" {n:>4} threads → {warps:>2} warps ({total:>4} slots, {wasted:>3} wasted, {eff:>{w_eff}.0f}% efficient)")
print()
print("Takeaway: always use multiples of 32 for thread counts.")
print("33 threads wastes 46% of warp capacity!")
The practical rule is simple: always choose thread counts that are multiples of 32 . Common choices are 128, 256, or 512 threads per block. Going higher (up to 1,024) increases occupancy but also increases register pressure, so the optimal choice depends on your kernel's resource usage.
Warp Divergence
Since all 32 threads in a warp execute the same instruction simultaneously, what happens when different threads need to take different branches? This situation — called warp divergence — is one of the most important performance pitfalls in GPU programming.
Consider this pseudocode for a CUDA kernel:
# CUDA kernel (pseudocode):
if threadIdx.x % 2 == 0:
result = expensive_path(data) # even threads
else:
result = cheap_path(data) # odd threads
The hardware cannot execute both branches simultaneously. Instead, it
serialises
them: first, all 32 threads execute
expensive_path
(with odd threads masked off), then all 32 execute
cheap_path
(with even threads masked off). The execution time is the sum of both branches, not the maximum — effectively doubling the cost compared to a warp where all threads take the same path.
import numpy as np
# Simulate 32 threads in a warp
warp_size = 32
thread_ids = np.arange(warp_size)
# Divergent: even/odd threads take different paths
even_mask = (thread_ids % 2 == 0)
odd_mask = ~even_mask
# Cost without divergence (all take same path)
cost_uniform = 1 # one pass through the branch
# Cost with divergence (serialised)
cost_divergent = 2 # pass 1: even threads, pass 2: odd threads
print(f"Warp of {warp_size} threads:")
print(f" Even threads: {even_mask.sum()}, Odd threads: {odd_mask.sum()}")
print(f" Without divergence: {cost_uniform} pass (all threads same branch)")
print(f" With divergence: {cost_divergent} passes (serialised branches)")
print(f" Slowdown: {cost_divergent}×")
print()
print("Worst case: 32 threads, each taking a different branch → 32 serial passes")
print("Best practice: restructure code so all threads in a warp take the same path")
Divergence is most harmful when branches are unbalanced (one path is expensive, the other cheap) or when threads within the same warp frequently take different paths. A common mitigation strategy is to restructure data so that threads in the same warp naturally follow the same control flow — for example, sorting work items by type before distributing them to threads.
The Memory Hierarchy
GPU memory is a hierarchy of progressively larger but slower levels, not unlike the cache hierarchy on CPUs — but with some important differences. On a GPU, some levels are explicitly managed by the programmer, which gives you more control but also more responsibility. From fastest to slowest:
┌─────────────────────────────────────────────────────────┐
│ Registers │ ~0 cycles │ Per-thread │ ~256 KB │
│ (fastest) │ │ (private) │ per SM │
├──────────────────┼───────────┼────────────────┼─────────┤
│ Shared Memory │ ~5 cycles │ Per-block │ ~164 KB │
│ / L1 Cache │ │ (shared) │ per SM │
├──────────────────┼───────────┼────────────────┼─────────┤
│ L2 Cache │ ~30 cyc. │ All SMs │ ~50 MB │
├──────────────────┼───────────┼────────────────┼─────────┤
│ Global Memory │ ~300 cyc. │ All SMs │ 80 GB │
│ (HBM / slowest) │ │ (GPU RAM) │ ~3 TB/s │
└─────────────────────────────────────────────────────────┘
Each level serves a distinct purpose:
- Registers: zero-latency, per-thread storage. The compiler allocates local variables to registers automatically. They're the fastest memory on the chip, but they're limited — on most architectures, each SM has roughly 65,536 32-bit registers shared among all its active threads. Complex kernels that need many variables per thread can cause register pressure , forcing the compiler to spill values to slower local memory.
- Shared memory: low-latency (~5 cycles), per-block storage that is explicitly managed by the programmer. Threads in the same block can read, write, and synchronise through shared memory. It is the workhorse of tiled algorithms: load a tile of data from global memory into shared memory, compute on it, then write results back. On Hopper (H100), each SM has up to 228 KB of combined shared memory and L1 cache.
- L2 cache: medium latency (~30 cycles), shared across all SMs. This is hardware-managed — the programmer doesn't control it directly. It acts as a buffer between the SMs and global memory, caching recently accessed data. The H100 has around 50 MB of L2.
- Global memory (HBM): highest latency (~300 clock cycles), largest capacity. This is the GPU's main RAM — 80 GB on the H100. All data starts and ends here. Bandwidth is impressive (~3 TB/s on H100), but the latency is roughly 60× higher than shared memory. Most kernels are bottlenecked by how fast they can move data from HBM to the compute units, not by the compute itself.
The overarching rule is: keep data as close to the cores as possible . Moving data from HBM to registers is, in many workloads, the true bottleneck — not the arithmetic. This is precisely why the roofline model (which we'll explore later in this track) plots performance against arithmetic intensity: if your kernel does too little math per byte loaded, you're memory-bound, and no amount of faster cores will help.
Memory Coalescing
When threads in a warp access global memory, the hardware tries to coalesce their individual requests into a smaller number of wide memory transactions. This is arguably the single most important optimisation concept in GPU programming, because it can make the difference between using 3% of available bandwidth and 100%.
Here's how it works. If all 32 threads in a warp access consecutive 4-byte addresses — thread 0 reads address 0, thread 1 reads address 4, thread 2 reads address 8, and so on up to thread 31 reading address 124 — the hardware combines all 32 requests into a single 128-byte transaction. That's 1 transaction instead of 32 , a massive bandwidth saving.
If threads access scattered or strided addresses, each request may trigger a separate transaction, wasting most of the bytes in each one. This is why memory access patterns matter enormously on GPUs — the same algorithm can run 10–30× faster simply by reorganising how threads access memory.
import numpy as np
warp_size = 32
element_size = 4 # bytes (float32)
transaction_size = 128 # bytes (GPU memory transaction width)
elements_per_txn = transaction_size // element_size # 32
# Coalesced: consecutive addresses
coalesced_addrs = np.arange(warp_size) * element_size
txns_coalesced = 1 # all fit in one 128-byte transaction
# Strided: every other element (stride = 2)
strided_addrs = np.arange(warp_size) * element_size * 2
txns_strided = 2 # spans 256 bytes → 2 transactions
# Random: scattered addresses
np.random.seed(42)
random_addrs = np.random.choice(10000, warp_size, replace=False) * element_size
# Worst case: up to 32 separate transactions
unique_txn_blocks = len(set(addr // transaction_size for addr in random_addrs))
txns_random = unique_txn_blocks
# Collect rows for aligned output
rows = [
("Coalesced (consecutive)", txns_coalesced, "baseline"),
("Strided (every other)", txns_strided, "slower"),
("Random (scattered)", txns_random, "slower"),
]
w_name = max(len(r[0]) for r in rows)
w_txn = max(len(str(r[1])) for r in rows)
w_mult = max(len(f"{r[1]}") for r in rows)
print("Memory Coalescing: 32 threads accessing global memory")
print(f" Transaction size: {transaction_size} bytes (fits {elements_per_txn} float32s)")
print()
for name, txns, label in rows:
suffix = "transaction" if txns == 1 else "transactions"
print(f" {name:<{w_name}} {txns:>{w_txn}} {suffix:<12} → {txns:>{w_mult}}× {label}")
print()
print("Coalesced access uses bandwidth efficiently.")
print("Scattered access wastes most of each transaction.")
The lesson is clear: when designing data structures for GPU computation, favour Structure of Arrays (SoA) over Array of Structures (AoS) . If each thread needs one field from a record, SoA layout means those fields are contiguous in memory and can be coalesced. AoS layout interleaves fields from different records, producing strided access patterns that waste bandwidth.
Quiz
Test your understanding of the CUDA programming model — thread hierarchies, warps, memory levels, and access patterns.
Why does CUDA launch threads in warps of exactly 32?
What happens when threads in a warp take different branches (warp divergence)?
What is shared memory used for?
Why is memory coalescing important?