Threads, Warps, and the CUDA Programming Model

The Thread Hierarchy

CUDA organises parallel work into a three-level hierarchy: threads → thread blocks → grids . Understanding this hierarchy is essential because it maps directly to the hardware: threads execute on CUDA Cores, thread blocks map to Streaming Multiprocessors (SMs), and a grid spans the entire GPU. If the mapping feels mechanical, that's by design — CUDA's programming model is a thin abstraction over the silicon, and keeping it close to the metal is what makes GPUs fast.

Grid (the entire GPU)
├── Block (0,0)              ← maps to an SM
│   ├── Warp 0 (threads 0-31)    ← 32 threads execute in lockstep
│   ├── Warp 1 (threads 32-63)
│   └── ...
├── Block (0,1)              ← maps to another SM
│   ├── Warp 0
│   └── ...
└── Block (1,0)
    └── ...

Let's unpack each level from the bottom up.

Thread: the smallest unit of execution. Each thread has its own registers and program counter, but it executes the same code as every other thread — just with different data. This is NVIDIA's SIMT (Single Instruction, Multiple Threads) model: one instruction stream, many data streams. If you've seen SIMD on CPUs, SIMT is a close cousin, but with the crucial difference that each thread has its own control flow (more on this in the divergence section below).

Warp: a group of exactly 32 threads that execute in lockstep. The hardware always schedules and executes warps, never individual threads. When a warp scheduler issues an instruction, all 32 threads in the warp execute it simultaneously. You can think of the warp as the true atomic unit of GPU execution — everything below it is a programming convenience, and everything above it is an organisational tool.

Thread Block (Cooperative Thread Array / CTA): a group of warps — up to 1,024 threads in most architectures — that execute on the same SM. Threads within a block can cooperate via shared memory and synchronisation barriers ( __syncthreads() ). This is the level where inter-thread communication happens. Threads in different blocks cannot directly communicate — they are independent by design, which is what allows the GPU scheduler to distribute blocks freely across SMs.

Grid: the collection of all thread blocks launched by a single kernel call. The GPU's block scheduler distributes blocks across the available SMs. A grid can contain millions of threads, and the hardware handles all the scheduling — the programmer simply declares the grid dimensions and lets the GPU figure out when and where each block runs.

💡 The Modal GPU Glossary (modal.com/gpu-glossary) provides excellent visual diagrams of SMs, warps, and the thread hierarchy. It's a useful companion reference for this material.

Why 32? The Warp as Execution Unit

The number 32 is baked into NVIDIA GPU hardware. Every instruction is issued to exactly 32 threads simultaneously — no more, no less. This seemingly arbitrary constant has profound implications for how you write and optimise GPU code.

When you launch a kernel with, say, 100 threads in a block, the hardware doesn't execute exactly 100 threads. It rounds up to the next multiple of 32 — in this case, 4 warps (128 threads). The last 28 threads in warp 3 are masked off : they go through the motions of executing each instruction, but their results are discarded. This means launching 33 threads wastes almost as many resources as launching 64 — a fact that surprises many newcomers to GPU programming.

Let's quantify the waste for different thread counts:

import json, js

def warps_needed(num_threads):
    return (num_threads + 31) // 32  # ceiling division by 32

thread_counts = [1, 32, 33, 64, 100, 128, 256, 1000]
rows = []
for n in thread_counts:
    warps = warps_needed(n)
    total = warps * 32
    wasted = total - n
    efficiency = n / total * 100
    rows.append([str(n), str(warps), str(total), str(wasted), f"{efficiency:.0f}%"])

print("Takeaway: always use multiples of 32 for thread counts.")
print("33 threads wastes 46% of warp capacity!")

js.window.py_table_data = json.dumps({
    "headers": ["Threads", "Warps", "Slots", "Wasted", "Efficiency"],
    "rows": rows
})

The practical rule is simple: always choose thread counts that are multiples of 32 . Common choices are 128, 256, or 512 threads per block. Going higher (up to 1,024) increases occupancy but also increases register pressure, so the optimal choice depends on your kernel's resource usage.

Warp Divergence

Since all 32 threads in a warp execute the same instruction simultaneously, what happens when different threads need to take different branches? This situation — called warp divergence — is one of the most important performance pitfalls in GPU programming.

Consider this pseudocode for a CUDA kernel:

# CUDA kernel (pseudocode):
if threadIdx.x % 2 == 0:
    result = expensive_path(data)    # even threads
else:
    result = cheap_path(data)        # odd threads

The hardware cannot execute both branches simultaneously. Instead, it serialises them: first, all 32 threads execute expensive_path (with odd threads masked off), then all 32 execute cheap_path (with even threads masked off). The execution time is the sum of both branches, not the maximum — effectively doubling the cost compared to a warp where all threads take the same path.

import numpy as np

# Simulate 32 threads in a warp
warp_size = 32
thread_ids = np.arange(warp_size)

# Divergent: even/odd threads take different paths
even_mask = (thread_ids % 2 == 0)
odd_mask = ~even_mask

# Cost without divergence (all take same path)
cost_uniform = 1  # one pass through the branch

# Cost with divergence (serialised)
cost_divergent = 2  # pass 1: even threads, pass 2: odd threads

print(f"Warp of {warp_size} threads:")
print(f"  Even threads: {even_mask.sum()}, Odd threads: {odd_mask.sum()}")
print(f"  Without divergence: {cost_uniform} pass (all threads same branch)")
print(f"  With divergence:    {cost_divergent} passes (serialised branches)")
print(f"  Slowdown: {cost_divergent}×")
print()
print("Worst case: 32 threads, each taking a different branch → 32 serial passes")
print("Best practice: restructure code so all threads in a warp take the same path")

Divergence is most harmful when branches are unbalanced (one path is expensive, the other cheap) or when threads within the same warp frequently take different paths. A common mitigation strategy is to restructure data so that threads in the same warp naturally follow the same control flow — for example, sorting work items by type before distributing them to threads.

💡 Modern GPUs (Volta and later) have independent thread scheduling within a warp, which can mitigate some divergence costs. But the fundamental issue remains: divergent warps underutilise the SIMT execution model. The hardware still serialises branches — it's just smarter about reconverging threads afterwards.

The Memory Hierarchy

GPU memory is a hierarchy of progressively larger but slower levels, not unlike the cache hierarchy on CPUs — but with some important differences. On a GPU, some levels are explicitly managed by the programmer, which gives you more control but also more responsibility. From fastest to slowest:

┌─────────────────────────────────────────────────────────┐
│ Registers        │ ~0 cycles │ Per-thread     │ ~256 KB │
│ (fastest)        │           │ (private)      │ per SM  │
├──────────────────┼───────────┼────────────────┼─────────┤
│ Shared Memory    │ ~5 cycles │ Per-block      │ ~164 KB │
│ / L1 Cache       │           │ (shared)       │ per SM  │
├──────────────────┼───────────┼────────────────┼─────────┤
│ L2 Cache         │ ~30 cyc.  │ All SMs        │ ~50 MB  │
├──────────────────┼───────────┼────────────────┼─────────┤
│ Global Memory    │ ~300 cyc. │ All SMs        │ 80 GB   │
│ (HBM / slowest)  │           │ (GPU RAM)      │ ~3 TB/s │
└─────────────────────────────────────────────────────────┘

Each level serves a distinct purpose:

Registers: zero-latency, per-thread storage. The compiler allocates local variables to registers automatically. They're the fastest memory on the chip, but they're limited — on most architectures, each SM has roughly 65,536 32-bit registers shared among all its active threads. Complex kernels that need many variables per thread can cause register pressure , forcing the compiler to spill values to slower local memory.
Shared memory: low-latency (~5 cycles), per-block storage that is explicitly managed by the programmer. Threads in the same block can read, write, and synchronise through shared memory. It is the workhorse of tiled algorithms: load a tile of data from global memory into shared memory, compute on it, then write results back. On Hopper (H100), each SM has up to 228 KB of combined shared memory and L1 cache.
L2 cache: medium latency (~30 cycles), shared across all SMs. This is hardware-managed — the programmer doesn't control it directly. It acts as a buffer between the SMs and global memory, caching recently accessed data. The H100 has around 50 MB of L2.
Global memory (HBM): highest latency (~300 clock cycles), largest capacity. This is the GPU's main RAM — 80 GB on the H100. All data starts and ends here. Bandwidth is impressive (~3 TB/s on H100), but the latency is roughly 60× higher than shared memory. Most kernels are bottlenecked by how fast they can move data from HBM to the compute units, not by the compute itself.

The overarching rule is: keep data as close to the cores as possible . Moving data from HBM to registers is, in many workloads, the true bottleneck — not the arithmetic. This is precisely why the roofline model (which we'll explore later in this track) plots performance against arithmetic intensity: if your kernel does too little math per byte loaded, you're memory-bound, and no amount of faster cores will help.

💡 The numbers above are approximate and vary by GPU generation. For exact specifications of specific GPUs (H100, A100, etc.), the Modal GPU Glossary (modal.com/gpu-glossary) maintains detailed hardware comparison tables.

Memory Coalescing

When threads in a warp access global memory, the hardware tries to coalesce their individual requests into a smaller number of wide memory transactions. This is arguably the single most important optimisation concept in GPU programming, because it can make the difference between using 3% of available bandwidth and 100%.

Here's how it works. If all 32 threads in a warp access consecutive 4-byte addresses — thread 0 reads address 0, thread 1 reads address 4, thread 2 reads address 8, and so on up to thread 31 reading address 124 — the hardware combines all 32 requests into a single 128-byte transaction. That's 1 transaction instead of 32 , a massive bandwidth saving.

If threads access scattered or strided addresses, each request may trigger a separate transaction, wasting most of the bytes in each one. This is why memory access patterns matter enormously on GPUs — the same algorithm can run 10–30× faster simply by reorganising how threads access memory.

import json, js
import numpy as np

warp_size = 32
element_size = 4  # bytes (float32)
transaction_size = 128  # bytes (GPU memory transaction width)
elements_per_txn = transaction_size // element_size  # 32

# Coalesced: consecutive addresses
txns_coalesced = 1  # all fit in one 128-byte transaction

# Strided: every other element (stride = 2)
txns_strided = 2  # spans 256 bytes → 2 transactions

# Random: scattered addresses
np.random.seed(42)
random_addrs = np.random.choice(10000, warp_size, replace=False) * element_size
unique_txn_blocks = len(set(addr // transaction_size for addr in random_addrs))
txns_random = unique_txn_blocks

raw = [
    ("Coalesced (consecutive)", txns_coalesced, "baseline"),
    ("Strided (every other)",   txns_strided,   "slower"),
    ("Random (scattered)",      txns_random,    "slower"),
]

rows = []
for name, txns, label in raw:
    suffix = "transaction" if txns == 1 else "transactions"
    rows.append([name, f"{txns} {suffix}", f"{txns}x {label}"])

print(f"Transaction size: {transaction_size} bytes (fits {elements_per_txn} float32s)")
print()
print("Coalesced access uses bandwidth efficiently.")
print("Scattered access wastes most of each transaction.")

js.window.py_table_data = json.dumps({
    "headers": ["Access Pattern", "Transactions", "Relative Cost"],
    "rows": rows
})

The lesson is clear: when designing data structures for GPU computation, favour Structure of Arrays (SoA) over Array of Structures (AoS) . If each thread needs one field from a record, SoA layout means those fields are contiguous in memory and can be coalesced. AoS layout interleaves fields from different records, producing strided access patterns that waste bandwidth.

Quiz

Test your understanding of the CUDA programming model — thread hierarchies, warps, memory levels, and access patterns.

Why does CUDA launch threads in warps of exactly 32?

32 is the maximum number of threads the OS can schedule per core

Because the hardware executes one instruction across exactly 32 CUDA Cores simultaneously (SIMT execution)

It's an arbitrary software convention that can be changed in the CUDA runtime settings

Because GPU memory buses are 32 bits wide

What happens when threads in a warp take different branches (warp divergence)?

The warp splits into two smaller warps, each executing one branch in parallel

The scheduler picks the branch taken by the majority and ignores the other threads

Both branches are serialised — all threads execute each branch, with non-participating threads masked off

The kernel raises an error and must be relaunched with uniform control flow

What is shared memory used for?

Long-term storage of model weights between kernel launches

Automatically caching global memory accesses (like a CPU L1 cache)

Explicitly programmer-managed per-block memory for thread cooperation and data reuse, with much lower latency than global memory

Transferring data between the CPU and GPU over PCIe

Why is memory coalescing important?

It compresses data to reduce GPU memory usage

It combines individual thread memory requests into fewer wide transactions, dramatically reducing the number of memory accesses and improving bandwidth utilisation

It allows threads in different blocks to share data without synchronisation

It enables the GPU to bypass the L2 cache for faster access to HBM