Inside the Streaming Multiprocessor

Anatomy of an SM

A Streaming Multiprocessor (SM) is the fundamental compute building block of every NVIDIA GPU. Article 1 introduced the SM at a high level — a self-contained processor with its own execution units, registers, and local memory. Now we open the lid and look at what is actually inside one.

An SM contains several types of execution units, each specialised for a different class of operations. The diagram below shows the layout of an SM in the H100 (Hopper) architecture . The exact numbers vary across GPU generations — earlier architectures have fewer CUDA Cores, different register file sizes, or no Tensor Cores at all — but the structural principles remain remarkably consistent.

┌─────────────────────────────────────────────────────────────┐
│                    Streaming Multiprocessor                   │
├──────────────┬──────────────┬──────────────┬────────────────┤
│  Warp Sched. │  Warp Sched. │  Warp Sched. │  Warp Sched.   │
│  + Dispatch  │  + Dispatch  │  + Dispatch  │  + Dispatch    │
├──────────────┴──────────────┴──────────────┴────────────────┤
│                                                              │
│  ┌────────────┐ ┌────────────┐ ┌─────┐ ┌─────┐ ┌────────┐  │
│  │ 32 CUDA    │ │ 32 CUDA    │ │ SFU │ │ SFU │ │ LSU ×8 │  │
│  │ Cores (FP32)│ │ Cores (FP32)│ │     │ │     │ │        │  │
│  └────────────┘ └────────────┘ └─────┘ └─────┘ └────────┘  │
│  ┌────────────┐ ┌────────────┐ ┌─────┐ ┌─────┐ ┌────────┐  │
│  │ 32 CUDA    │ │ 32 CUDA    │ │ SFU │ │ SFU │ │ LSU ×8 │  │
│  │ Cores (FP32)│ │ Cores (FP32)│ │     │ │     │ │        │  │
│  └────────────┘ └────────────┘ └─────┘ └─────┘ └────────┘  │
│                                                              │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐            │
│  │ Tensor Core │ │ Tensor Core │ │ Tensor Core │ ...×4      │
│  │ (MMA 4×4)   │ │ (MMA 4×4)   │ │ (MMA 4×4)   │            │
│  └─────────────┘ └─────────────┘ └─────────────┘            │
│                                                              │
│  ┌──────────────────────────────────────────────────────┐    │
│  │              256 KB Register File                     │    │
│  │         (65,536 × 32-bit registers)                   │    │
│  └──────────────────────────────────────────────────────┘    │
│  ┌──────────────────────────────────────────────────────┐    │
│  │         256 KB L1 Data Cache / Shared Memory          │    │
│  │      (configurable split, e.g., 192 KB / 64 KB)      │    │
│  └──────────────────────────────────────────────────────┘    │
└──────────────────────────────────────────────────────────────┘

Let's walk through each component:

Warp Schedulers (×4): each scheduler picks one ready warp per cycle and dispatches its next instruction to the appropriate execution unit. Four schedulers operating in parallel is what allows the SM to keep multiple execution pipelines fed simultaneously.
CUDA Cores (128 total, 4 groups of 32): general-purpose floating-point and integer execution units. Each performs one FP32 multiply-add per clock cycle. These are the workhorses of general computation.
Tensor Cores (×4): specialised matrix multiply-accumulate units. Each computes a $4 \times 4$ matrix multiply-add in a single cycle — roughly 128 multiply-adds where a CUDA Core does one.
SFUs — Special Function Units (×4): dedicated hardware for transcendental math (sin, cos, exp, log, reciprocal square root). These operations would take many cycles on general-purpose ALUs.
LSUs — Load/Store Units (×16): handle memory requests to global memory (HBM) and shared memory. Memory coalescing — combining many thread-level requests into fewer wide transactions — happens at this stage.
Register File (256 KB): the fastest memory available, with zero-latency access. Private to each thread, divided among all active threads on the SM.
L1 Cache / Shared Memory (256 KB): fast on-chip SRAM with a configurable split between hardware-managed L1 cache and programmer-managed shared memory.

The rest of this article examines each component in more detail, starting with the CUDA Cores.

CUDA Cores: The Workhorses

CUDA Cores are the general-purpose floating-point and integer execution units inside the SM. Each one can perform one FP32 multiply-add per clock cycle. An H100 SM has 128 CUDA Cores arranged in 4 groups of 32, giving it 128 FP32 operations per cycle.

Why 32 per group? Because a warp is 32 threads, and each warp scheduler dispatches one instruction to 32 CUDA Cores simultaneously — one core per thread. The 4 groups map to the 4 warp schedulers: each scheduler can issue one warp instruction per cycle, and the 4 operate in parallel. This is the fundamental link between the SIMT execution model and the physical hardware.

CUDA Cores handle a wide range of operations: addition, multiplication, fused multiply-add (FMA), integer arithmetic, comparison, and bit operations. If your kernel is doing scalar math — adding vectors, computing element-wise functions, running control flow — those operations are likely flowing through CUDA Cores. They are the default execution path for most GPU workloads.

💡 The term "CUDA Core" is NVIDIA's marketing name for what is essentially a single-precision floating-point ALU. Each core is far simpler than a CPU core — no branch predictor, no out-of-order engine, no private cache. The SM as a whole manages scheduling and memory; the individual core just executes the operation it receives.

(See also "CUDA Core" in the Modal GPU Glossary for a concise hardware-level definition.)

Tensor Cores: Matrix Multiply Machines

Tensor Cores are specialised hardware for one specific operation: small matrix multiply-accumulate (MMA). In a single cycle, a Tensor Core computes:

D = A \times B + C

where $A$ is $4 \times 4$, $B$ is $4 \times 4$, $C$ is $4 \times 4$ (the accumulator), and $D$ is the $4 \times 4$ result. That is roughly 128 multiply-add operations in a single cycle — compared to a CUDA Core doing exactly 1.

Let's break down each part of that equation:

$A$ and $B$: the input matrices, typically stored in FP16 or BF16 for efficiency. Lower precision means each element is half the size of FP32, so twice as many values fit in the same register space, and the multiply circuits are smaller and faster.
$C$: the accumulator, often stored in FP32 for precision. You multiply in low precision but accumulate in high precision — this is the basis of mixed-precision training . The idea is that individual multiply errors from FP16 are small, but they compound across thousands of additions. Accumulating in FP32 keeps the running sum accurate.
$D$: the result, which overwrites $C$ for the next tile iteration. In a large matmul, $C$ accumulates results from many tile multiplications before being written back.

Why $4 \times 4$? It appears to be a hardware design sweet spot — small enough to implement efficiently in silicon (the number of wires and multiply units scales with the square of the dimension), large enough that tiling a large matmul into $4 \times 4$ blocks amortises control overhead. In practice, the programmer rarely thinks in $4 \times 4$ tiles directly. Libraries like cuBLAS and Triton tile large matrices into blocks (often $16 \times 16$ or larger) and map each block down to sequences of Tensor Core MMA instructions.

An H100 SM has 4 Tensor Cores. At FP16 precision, each performs roughly 128 multiply-adds per cycle, so the SM delivers:

4 \text{ Tensor Cores} \times 128 \text{ ops/cycle} = 512 \text{ FP16 multiply-adds per cycle per SM}

Across the full H100 GPU with 132 SMs at roughly 1.8 GHz:

132 \times 512 \times 1.8 \times 10^9 \approx 122 \text{ TFLOPS (FP16, without sparsity)}

This is why Tensor Cores matter so much for deep learning. Matrix multiplication dominates the compute in both training and inference (attention, linear layers, convolutions expressed as matmuls), and Tensor Cores accelerate it by roughly two orders of magnitude over CUDA Cores alone.

The simulation below demonstrates the $4 \times 4$ MMA operation. We use FP16 inputs and an FP32 accumulator, mirroring what the hardware actually does:

import numpy as np

# Tensor Core operation: D = A @ B + C (4×4 matrices)
A = np.array([[1, 0, 2, 1],
              [0, 1, 1, 0],
              [2, 0, 1, 1],
              [1, 1, 0, 2]], dtype=np.float16)

B = np.array([[1, 2, 0, 1],
              [0, 1, 1, 0],
              [1, 0, 2, 1],
              [2, 1, 0, 1]], dtype=np.float16)

C = np.zeros((4, 4), dtype=np.float32)  # accumulator in FP32

# One Tensor Core cycle: D = A @ B + C
D = (A.astype(np.float32) @ B.astype(np.float32)) + C

print("Tensor Core MMA: D = A × B + C")
print(f"\nA (FP16):\n{A}")
print(f"\nB (FP16):\n{B}")
print(f"\nC (FP32 accumulator):\n{C}")
print(f"\nD = A @ B + C (FP32 result):\n{D}")
print(f"\nOperations in one cycle: {4*4*(2*4-1)} multiply-adds")
print(f"A CUDA Core does 1 per cycle. A Tensor Core does ~128.")

(See also "Tensor Core" in the Modal GPU Glossary for more on how these units map to the warp-level MMA instructions.)

Special Function Units and Load/Store Units

Two more execution unit types round out the SM's computational toolkit.

SFUs (Special Function Units) handle transcendental math — sin , cos , exp , log , and reciprocal square root ( rsqrt ). Computing these with plain multiply-adds would require iterative approximations (Taylor series, CORDIC algorithms) taking many cycles. Dedicated SFU hardware handles them in significantly fewer cycles — typically around 8 cycles per operation, compared to dozens for a software implementation. An SM typically has 4 SFUs.

Where do SFUs show up in deep learning? Activation functions like GELU and Swish involve exp and tanh ; softmax requires exp across an entire row of logits; layer normalisation uses rsqrt . These operations are not the bottleneck in most models (matmuls dominate), but without SFUs they would be noticeably slower.

LSUs (Load/Store Units) handle memory requests. When a thread needs data from global memory (GPU HBM) or shared memory, an LSU issues the request and manages the transaction. An SM has roughly 32 LSUs, allowing it to handle multiple memory requests per cycle.

Memory coalescing happens at this level: when many threads in a warp request addresses that fall within the same 128-byte cache line, the LSUs can combine those into a single wide memory transaction rather than issuing 32 separate requests. This is one of the most important performance considerations in CUDA programming — coalesced access patterns can be an order of magnitude faster than scattered ones. We will return to coalescing in detail in a later article on memory access patterns.

The Register File: Fastest Memory

The register file is the fastest memory available to threads — zero additional latency, accessible every cycle, and private to each thread. An H100 SM has 256 KB of registers, which amounts to 65,536 individual 32-bit registers. This is a substantial amount of storage — larger than the entire L1 cache of most CPU cores — and it needs to be, because it is shared among all active threads on the SM.

This creates a fundamental tradeoff. If each thread uses many registers (common in complex kernels that keep intermediate results close at hand), then fewer threads can be active simultaneously, reducing occupancy — the ratio of active warps to the maximum the SM supports. Lower occupancy means the warp schedulers have fewer warps to choose from when one stalls on a memory request, reducing the SM's ability to hide latency by switching between warps.

This is register pressure — one of the most subtle performance cliffs in GPU programming. A kernel that uses 128 registers per thread might run at 25% occupancy, while one that uses 24 registers per thread can fill the SM entirely. Whether the reduced occupancy actually hurts performance depends on whether the kernel is compute-bound or memory-bound. We will explore this tradeoff in depth in article 5.

The following simulation shows how the register budget constrains active threads:

import json, js

total_registers = 65536
max_threads = 2048

scenarios = [
    ("Simple kernel",  24),
    ("Complex kernel", 128),
]
rows = []
for name, regs in scenarios:
    threads = min(total_registers // regs, max_threads)
    warps = threads // 32
    occupancy = threads / max_threads
    rows.append([name, str(regs), str(threads), str(warps), f"{occupancy:.0%}"])

print(f"Total registers: {total_registers}")
print(f"Max threads/SM:  {max_threads}")
print()
print("Fewer active warps → less latency hiding → potential stalls.")
print("This is the register pressure problem (article 5).")

js.window.py_table_data = json.dumps({
    "headers": ["Kernel", "Regs/Thread", "Threads", "Warps", "Occupancy"],
    "rows": rows
})

Notice the dramatic difference: 100% occupancy vs. 25%. The simple kernel gives the warp schedulers 64 warps to cycle through, while the complex kernel gives them only 16. Whether those 16 warps are enough to keep the execution units busy depends on the kernel's arithmetic intensity — a topic we will formalise with the roofline model later in the track.

L1 Cache and Shared Memory

Each SM has a combined 256 KB of fast on-chip SRAM that serves a dual purpose:

L1 Data Cache: automatically caches data fetched from global memory (HBM). The hardware manages what gets cached and what gets evicted — the programmer does not control it directly. This helps with irregular access patterns where threads in different warps happen to touch the same memory region.
Shared Memory: explicitly programmer-managed memory that is shared among all threads in a thread block (not across the entire SM — only within a single block). Threads can read, write, and synchronise through shared memory, enabling cooperative algorithms like tiled matrix multiplication and parallel reductions.

The split between L1 and shared memory is configurable. On the H100, the 256 KB can be divided as, for example, 192 KB shared memory / 64 KB L1, or shifted the other way depending on the kernel's needs. The CUDA runtime provides an API ( cudaFuncSetAttribute ) to request a preferred split before launching a kernel.

The rule of thumb is straightforward: kernels that need thread cooperation — tiled matmul, reductions, stencil computations — benefit from allocating more shared memory. Kernels with irregular, data-dependent access patterns benefit from a larger L1 cache. Most deep learning kernels fall into the first category, which is why shared memory is so central to high-performance GPU programming.

💡 Shared memory has much lower latency than global memory (roughly 20–30 cycles vs. 200–400 cycles for an L1 miss going to HBM), but it is still slower than registers (which have zero additional latency). The memory hierarchy inside an SM, from fastest to slowest, is: registers → shared memory / L1 → L2 cache (shared across all SMs) → HBM.

(See also "Shared Memory" in the Modal GPU Glossary for more on bank conflicts and optimal access patterns.)

Warp Schedulers: Keeping the SM Busy

Each SM has 4 warp schedulers. Every clock cycle, each scheduler can:

1. Pick one of its assigned warps that is ready to execute — meaning it is not waiting for memory data to arrive, not stalled on a synchronisation barrier, and has its next instruction decoded and ready.
2. Issue one instruction from that warp to the appropriate execution units — CUDA Cores for arithmetic, Tensor Cores for MMA, SFUs for transcendentals, or LSUs for memory operations.

With 4 schedulers operating in parallel, an SM can issue up to 4 warp instructions per cycle — potentially to different execution units. For instance, in a single cycle, one scheduler might send a warp to the CUDA Cores for an FMA, another might dispatch a warp to a Tensor Core for an MMA, a third might issue a memory load through the LSUs, and the fourth might send a warp to an SFU for an exp computation. This parallelism across execution unit types is a key source of throughput.

This is the core mechanism behind latency hiding . When warp A issues a memory load and must wait hundreds of cycles for the data to arrive from HBM, the scheduler does not idle — it picks warp B, which is ready to execute. When warp B stalls on a __syncthreads() barrier, it picks warp C. As long as there are enough ready warps to cycle through, the execution units stay busy and the SM wastes very few cycles.

This is why occupancy matters . More active warps means more candidates for the scheduler to choose from, and fewer cycles where every warp is stalled simultaneously. A fully occupied SM with 64 warps can tolerate long memory latencies because there is almost always at least one warp that is ready. An SM with only 4 active warps has far less room to manoeuvre — if all 4 are waiting on memory, the execution units sit completely idle until data arrives.

That said, higher occupancy does not always translate to higher performance. A compute-bound kernel that keeps the execution units busy with arithmetic may perform just as well at 50% occupancy as at 100%. The benefit of high occupancy is specifically in hiding memory latency — and it only helps when there is latency to hide.

(See also "Warp Scheduler" in the Modal GPU Glossary for more on scheduling policies and dual-issue capabilities.)

Quiz

Test your understanding of the Streaming Multiprocessor's internal architecture.

How many FP32 operations can 128 CUDA Cores perform per clock cycle?

32 (one per warp scheduler)

128 (one per core)

512 (four per core)

256 (two per core)

What operation does a Tensor Core perform in one cycle?

A single FP16 multiply-add

A 16×16 matrix transpose

A 4×4 matrix multiply-accumulate (D = A×B + C), computing ~128 multiply-adds

A convolution over a 3×3 filter

Why does using more registers per thread reduce occupancy?

More registers make each thread slower to execute

The compiler limits threads when register usage is high

The SM has a fixed register budget; more registers per thread means fewer threads can be active simultaneously

Registers conflict with shared memory allocation

How do warp schedulers hide memory latency?

By prefetching data from HBM into registers before it is needed

By switching to a different ready warp when the current warp stalls on a memory request

By compressing memory requests to reduce transfer time

By caching all global memory accesses in shared memory