The GPU: A Machine Built for Throughput

Latency vs Throughput

CPUs are optimised for latency — executing a single thread of instructions as fast as possible. They invest transistors in branch prediction, speculative execution, out-of-order pipelines, and large caches (often 50% or more of the die area). The result is that one thread runs very fast, but there is limited parallelism available.

GPUs make the opposite bet: throughput . They sacrifice single-thread performance for massive parallelism. Instead of making one thread fast, they run thousands of threads simultaneously, each slower individually but collectively processing far more work per second. As the Modal GPU Glossary puts it, GPUs are throughput-oriented processors — they are designed to maximise the total amount of work completed per unit of time rather than minimise the time any single task takes.

A helpful analogy: a CPU is like a sports car — very fast for one passenger. A GPU is like a bus — slower per passenger, but it moves 50 people at once. For workloads that process millions of independent data points (pixels, matrix elements, token embeddings), the bus wins decisively.

We can illustrate this throughput advantage with a simple comparison. A Python for loop processes elements one at a time (like a CPU executing sequentially), while a vectorised numpy operation applies the same transformation to all elements simultaneously (analogous to what a GPU does natively).

import numpy as np
import time

# Simulate: CPU-style sequential vs GPU-style parallel
n = 1_000_000
data = np.random.randn(n).astype(np.float32)

# "Sequential" (one element at a time)
start = time.time()
result_seq = np.array([x * 2.0 + 1.0 for x in data])
seq_time = time.time() - start

# "Parallel" (vectorised — this is what GPUs do natively)
start = time.time()
result_par = data * 2.0 + 1.0
par_time = time.time() - start

print(f"Sequential (Python loop): {seq_time:.3f}s")
print(f"Parallel (numpy vectorised): {par_time:.4f}s")
print(f"Speedup: {seq_time/par_time:.0f}×")
print(f"\nThis is a simplified analogy for the CPU vs GPU difference:")
print(f"GPUs apply the SAME operation to MANY data points simultaneously.")

The vectorised path is typically 50–200× faster, even on a CPU — and it only hints at the scale of parallelism a real GPU provides. An H100, for example, can sustain over 16,000 threads executing floating-point arithmetic every clock cycle.

💡 For a comprehensive reference on GPU hardware concepts, see the Modal GPU Glossary (https://modal.com/gpu-glossary), which covers device hardware, software, and performance in depth. This track synthesises those concepts into a structured learning path.

The Physical Hierarchy

A GPU is organised into a nested hierarchy of processing units. This hierarchy exists because of fundamental physical constraints — you cannot wire every core directly to every byte of memory. Instead, the chip groups cores together at progressively larger scales, with each level providing its own memory and scheduling resources. From the top level down, the hierarchy looks like this:

GPC (Graphics/GPU Processing Cluster): the largest processing block. An H100 has 8 GPCs. Each GPC is an independent cluster with its own raster engine and memory interface.
TPC (Texture Processing Cluster): each GPC contains multiple TPCs (typically 2 in modern architectures). A TPC groups Streaming Multiprocessors together for scheduling purposes.
SM (Streaming Multiprocessor): the core compute unit. An H100 has 132 SMs total. Each SM contains CUDA Cores, Tensor Cores, caches, and a register file. This is where the actual math happens.
Cores: inside each SM sit the execution units — CUDA Cores (for general floating-point math), Tensor Cores (for matrix operations), Special Function Units or SFUs (for sin , cos , exp ), and Load/Store Units or LSUs (for memory access).

The following diagram shows this hierarchy for the NVIDIA H100 (see the Modal GPU Glossary for additional detail on each component):

GPU (e.g., H100)
├── GPC 0                          ← 8 GPCs
│   ├── TPC 0                     ← 2 TPCs per GPC (typ.)
│   │   ├── SM 0                  ← ~8 SMs per GPC
│   │   │   ├── 128 CUDA Cores    ← FP32 arithmetic
│   │   │   ├── 4 Tensor Cores    ← matrix multiply-accumulate
│   │   │   ├── 4 SFUs            ← sin, cos, exp, rsqrt
│   │   │   ├── 32 LSUs           ← load/store from memory
│   │   │   ├── 4 Warp Schedulers ← pick warps to execute
│   │   │   ├── 256 KB Register File  ← per-thread fast storage
│   │   │   └── 256 KB L1/Shared Memory
│   │   └── SM 1 ...
│   └── TPC 1 ...
├── GPC 1 ...
├── ...
├── L2 Cache (50 MB on H100)
└── HBM3 GPU RAM (80 GB on H100, ~3 TB/s bandwidth)

The nested structure serves two purposes. First, it provides progressively larger but slower memory at each level: registers (fastest, per-thread) → L1/shared memory (per-SM) → L2 cache (chip-wide) → HBM (off-chip, highest capacity). Second, it groups cores together for efficient scheduling — a warp scheduler inside an SM can dispatch work to its local cores without coordinating with the rest of the chip.

Streaming Multiprocessors: The Core Compute Unit

The SM is where all computation happens. Think of it as a miniature processor, but one designed very differently from a CPU core.

A CPU core is complex: it predicts branches, reorders instructions, speculates on future work, and maintains a large cache hierarchy — all to make a single thread run as fast as possible. An SM is simpler: it runs instructions in order, does not predict branches, and has a relatively small cache. Instead, it manages up to 2,048 concurrent threads (64 warps of 32 threads each) and hides memory latency by switching between warps every clock cycle.

When one warp is waiting for data from memory — a latency that can take hundreds of cycles — the SM immediately switches to another warp that has data ready. This technique, often called latency hiding through occupancy , is the fundamental trick that makes GPUs fast. They do not make memory faster; they just do useful work while waiting for it. The Modal GPU Glossary describes this as the GPU's answer to the memory wall: rather than reducing latency, it tolerates latency by keeping enough work in flight that there is always something to execute.

Here are the concrete numbers for the NVIDIA H100:

132 SMs × 128 CUDA Cores = 16,896 CUDA Cores total
132 SMs × 4 Tensor Cores = 528 Tensor Cores total
132 SMs × 2,048 max threads = 270,336 concurrent threads
Each SM: 256 KB register file, 256 KB L1/shared memory

Compare that to a high-end CPU like the AMD EPYC 9654 — it has 96 cores with 2 threads each, for a total of 192 concurrent threads. The H100 manages roughly 1,400× more concurrent threads. Individually each GPU thread is far simpler and slower, but the aggregate throughput is what matters for data-parallel workloads.

💡 The SM is covered in depth in article 2. The key takeaway here: an SM is not a CPU core. It's simpler individually but manages orders of magnitude more concurrent threads, and its performance comes from switching between them rather than making any single one fast.

Why This Matters for Deep Learning

Neural network training and inference are dominated by matrix multiplications — and matrix multiplication is embarrassingly parallel . Each output element is an independent dot product that can be computed without waiting for any other element.

Consider a matrix multiply of two 4096×4096 matrices. This involves roughly $2 \times 4096^3 \approx 137$ billion FP16 multiply-accumulate operations. On a CPU, this might take hundreds of milliseconds. On an H100 with 528 Tensor Cores running at approximately 1,000 TFLOPS (FP16), it can complete in under a millisecond. That is the throughput advantage in practice.

Let's see this with a small example. We'll multiply two small matrices and verify that every output element is an independent dot product — exactly the kind of work that maps perfectly onto thousands of parallel GPU threads.

import numpy as np

# Matrix multiply: each output element is an independent dot product
M, K, N = 4, 3, 2
A = np.random.randn(M, K).round(1)
B = np.random.randn(K, N).round(1)
C = A @ B

print(f"A ({M}×{K}) @ B ({K}×{N}) = C ({M}×{N})")
print(f"\nA:\n{A}")
print(f"\nB:\n{B}")
print(f"\nC = A @ B:\n{C}")
print(f"\nEach element of C is an independent dot product:")
for i in range(M):
    for j in range(N):
        dot = sum(A[i,k] * B[k,j] for k in range(K))
        print(f"  C[{i},{j}] = {' + '.join(f'{A[i,k]:.1f}×{B[k,j]:.1f}' for k in range(K))} = {dot:.2f}")

total_ops = M * N * (2 * K - 1)
print(f"\nTotal operations: {total_ops} (all independent → perfect for GPUs)")
print(f"For 4096×4096 matmul: ~{2 * 4096**3 / 1e9:.0f} billion ops")

In real deep learning workloads, this pattern repeats at every layer of the network. A single forward pass through a large language model may involve hundreds of matrix multiplications, each with dimensions in the thousands. Without the massive parallelism of a GPU, modern deep learning would simply not be practical — training runs that take days on GPUs would take years on CPUs.

Quiz

Test your understanding of GPU architecture fundamentals — the throughput model, physical hierarchy, and why GPUs are well-suited for deep learning workloads.

What is the fundamental design difference between a CPU core and a GPU SM?

CPUs have more transistors, while GPUs have more memory bandwidth

CPUs optimise for single-thread latency (branch prediction, speculation); GPUs optimise for throughput via massive parallelism with simpler cores

CPUs use floating-point arithmetic while GPUs use integer arithmetic

CPUs run compiled code while GPUs can only run interpreted code

How does a GPU Streaming Multiprocessor (SM) hide memory latency?

By using a very large L1 cache to avoid memory accesses entirely

By predicting which memory addresses will be needed and prefetching them

By switching to a different warp when the current one is waiting for data, doing useful work instead of stalling

By compressing data in memory so that transfers are faster

In the GPU physical hierarchy, what components does a Streaming Multiprocessor (SM) contain?

Only CUDA Cores and a small cache

CUDA Cores, Tensor Cores, SFUs, LSUs, warp schedulers, a register file, and L1/shared memory

GPCs, TPCs, and a global memory controller

A branch predictor, reorder buffer, and speculation unit

Why is matrix multiplication particularly well-suited for GPU execution?

Matrix multiplication requires very little memory, which suits the GPU's small caches

Matrix multiplication uses only integer arithmetic, which GPUs specialise in

Each output element is an independent dot product, making the computation embarrassingly parallel

Matrix multiplication is a sequential algorithm that benefits from high clock speeds