The Five Layers
When you write
torch.matmul(A, B)
, a surprisingly deep stack of software transforms that one-line Python call into machine code running on a GPU. There are at least five distinct layers between your script and the transistors that do the actual arithmetic, and each one exists for a good reason. This article traces the full path — from Python to silicon — so you can see exactly where your computation lives at each stage.
┌───────────────────────────────────────────────────────────┐
│ Layer 1: Python Frontend │
│ torch.matmul(A, B) │
├───────────────────────────────────────────────────────────┤
│ Layer 2: pybind11 Binding │
│ Python objects → C++ objects (memory pointers, shapes) │
├───────────────────────────────────────────────────────────┤
│ Layer 3: C++ Backend (ATen) │
│ at::matmul → dispatcher → route by device/dtype │
├───────────────────────────────────────────────────────────┤
│ Layer 4: Kernel Libraries │
│ cuBLAS (linear algebra) / cuDNN (DNN ops) │
│ Pre-compiled SASS binaries shipped with pip install torch │
├───────────────────────────────────────────────────────────┤
│ Layer 5: Hardware │
│ SASS → microcode → electrical signals → ALU transistors │
└───────────────────────────────────────────────────────────┘
Here is the one-sentence summary of each layer. Layer 1 (Python Frontend) is the familiar API you interact with — it validates inputs and records autograd operations but performs no math. Layer 2 (pybind11) is the language bridge that translates Python objects into C++ objects, passing memory pointers and tensor metadata across the boundary. Layer 3 (ATen + Dispatcher) is the C++ routing layer that examines each tensor's device, dtype, and layout and selects the right kernel implementation. Layer 4 (Kernel Libraries) is where the actual computation happens — vendor-optimised libraries like cuBLAS and cuDNN that ship as pre-compiled GPU assembly. Layer 5 (Hardware) is the physical substrate: GPU assembly instructions decoded into electrical signals that toggle transistors in the chip's arithmetic units.
The rest of this article walks through each layer in detail, building a mental model of what happens from the moment you press Enter to the moment a result appears in your tensor.
Layer 1: Python Frontend
torch.matmul(A, B)
is a Python function defined in PyTorch's Python package. It lives in a file you can inspect yourself (
torch/functional.py
or similar, depending on the version), and it does almost no computation. Its primary job is to
validate inputs
: are the shapes compatible for matrix multiplication? Are both tensors on the same device? Is the dtype supported? If something is wrong, you get a clear Python-level error message rather than a cryptic segfault from C++.
Once validation passes, the Python function calls into C++ via
pybind11
. pybind11 is a lightweight header-only library that generates Python bindings for C++ code. It handles the messy work of translating Python objects (
torch.Tensor
) into their C++ counterparts (
at::Tensor
), passing across the language boundary everything the C++ side needs: raw memory pointers to the tensor's storage, shape and stride metadata, dtype information, and the device tag (CPU, CUDA, MPS, etc.). This binding layer is why PyTorch
feels
like Python — you write Pythonic code and get Pythonic errors — but runs at C++ speed.
Layer 2: The C++ Dispatcher
Inside the C++ backend, the function call lands in ATen (short for A Tensor library ) — PyTorch's core tensor operations library. ATen provides the canonical C++ signature for every operation (matmul, add, conv2d, and hundreds more), but it does not implement matmul for every device/dtype combination directly. Instead, it delegates to a dispatcher .
The dispatcher is essentially a routing table. When
at::matmul
is called, the dispatcher examines the tensor's metadata — device, dtype, layout, and whether autograd tracing is active — and selects the right kernel implementation. The
same
C++ entry point
at::matmul
can route to:
- A CPU implementation (using Intel MKL or OpenBLAS) if the tensors live in main memory
- A cuBLAS call if the tensors are on a CUDA GPU
- A different kernel entirely if mixed precision, sparse layout, or a quantised dtype is involved
This design is what makes PyTorch extensible. New backends — a TPU backend, an Apple MPS backend, a custom accelerator from a hardware startup — can be added by
registering new dispatch keys
. The frontend Python code doesn't change at all. The user still writes
torch.matmul(A, B)
; the dispatcher handles routing it to the right place based on where
A
and
B
happen to live. This separation of interface from implementation is one of the most important architectural decisions in PyTorch's codebase.
The dispatch mechanism also handles
composable transforms
. Autograd, torch.compile's tracing, vmap (vectorised batching), and functionalization are all implemented as dispatch keys that can be stacked. When you call
torch.matmul
on a tensor that requires gradients and is being compiled, the dispatcher chains through the Autograd key (which records the operation for backpropagation), then the compilation key (which captures the operation into a graph), and finally the device-specific key (which executes the actual math). Each key does its job and forwards to the next.
Layer 3: CUDA and the Kernel Libraries
For GPU tensors, the dispatcher routes to CUDA kernels. PyTorch doesn't implement most CUDA kernels from scratch — it calls into NVIDIA's vendor-optimised libraries, which have been tuned over many years for each GPU microarchitecture. The two most important libraries are:
cuBLAS
(CUDA Basic Linear Algebra Subprograms) provides highly optimised implementations of fundamental linear algebra operations: matrix multiplication, dot products, matrix-vector multiplies, triangular solves, and more. These are the workhorses behind every linear layer in a neural network. When you compute
output = weight @ input + bias
, the matrix multiplication ultimately lands in a cuBLAS
sgemm
(single-precision general matrix multiply) or
hgemm
(half-precision) call. NVIDIA engineers have spent years optimising these routines, and they typically achieve over 80% of the GPU's theoretical peak FLOPS — far better than a naive CUDA kernel would.
cuDNN (CUDA Deep Neural Network library) provides optimised implementations of higher-level deep learning primitives: convolution, pooling, batch normalisation, softmax, and (in newer versions) multi-head attention. These are the workhorses behind CNN layers and transformer attention blocks. cuDNN doesn't just implement these operations naively — it selects among multiple algorithms (e.g., Winograd convolution, FFT-based convolution, direct convolution) and benchmarks them at runtime to find the fastest one for your specific tensor shapes and GPU.
Both cuBLAS and cuDNN ship as
pre-compiled SASS binaries
inside the
torch
Python package. When you run
pip install torch
, you're downloading not just Python code but hundreds of megabytes of pre-compiled GPU assembly — optimised machine code ready to run on your GPU without any compilation step on your end.
# What "pip install torch" puts on your system:
#
# torch/
# ├── __init__.py ← Python frontend
# ├── _C.so ← C++ backend (compiled via pybind11)
# ├── lib/
# │ ├── libtorch.so ← ATen, autograd, dispatcher
# │ ├── libtorch_cuda.so ← CUDA dispatch implementations
# │ ├── libcublas.so ← cuBLAS (pre-compiled SASS)
# │ ├── libcudnn.so ← cuDNN (pre-compiled SASS)
# │ └── ...
# └── nn/, optim/, ... ← High-level Python modules
This explains why
pip install torch
downloads over 2 GB: you're getting a complete numerical computing stack, from Python wrappers all the way down to GPU machine code, in a single package.
Layer 4: From Source Code to Silicon
The cuBLAS and cuDNN binaries didn't appear from thin air — someone at NVIDIA wrote C++ CUDA source code and compiled it through a multi-stage pipeline. Understanding this pipeline is valuable because the same stages apply whenever
anyone
writes a custom CUDA kernel (including the kernels that
torch.compile
generates via Triton, which we'll cover in a later article).
Source code NVCC compiler Hardware
┌──────────┐ ┌──────────────────────────┐ ┌──────────────┐
│ C++ CUDA │ → │ .cu → PTX → SASS │ → │ SASS → μcode │
│ (.cu) │ │ │ │ → transistors │
└──────────┘ │ PTX: portable IR │ └──────────────┘
│ (like LLVM IR for GPUs)│
│ │
│ SASS: GPU-specific ASM │
│ (different per chip: │
│ Ampere, Hopper, etc.) │
└──────────────────────────┘
PTX (Parallel Thread Execution) is NVIDIA's virtual instruction set architecture. It's a portable intermediate representation — the same PTX code can, in principle, run on any NVIDIA GPU, regardless of generation. Think of it as something analogous to LLVM IR or Java bytecode: a stable abstraction layer that decouples the source language from the target hardware. PTX instructions describe operations like "multiply two floats" or "load from global memory" without specifying exactly which hardware units will execute them.
SASS (Streaming ASSembly)
is the actual GPU machine code, specific to a particular GPU microarchitecture (identified by an SM version number). PTX is compiled to SASS by
ptxas
, NVIDIA's assembler (part of the CUDA Toolkit). Different GPU architectures produce different SASS from the same PTX — an Ampere GPU (SM 8.0) and a Hopper GPU (SM 9.0) will generate different instruction sequences, each tuned for that chip's pipeline widths, register file sizes, and memory hierarchy.
Microcode is the final stage. At load time, the GPU's control unit decodes SASS instructions into electrical signals that toggle specific transistors in the Streaming Multiprocessor's ALU (Arithmetic Logic Unit) and register files. This is where computation becomes physics: a multiply-accumulate instruction turns into voltage changes that ripple through transistor gates, producing the sum of products that makes up a matrix multiplication.
Why two compilation stages (PTX then SASS) instead of compiling directly to SASS? The answer is a classic tradeoff between portability and performance . PTX gives portability — write your kernel once, and it can run on any current or future NVIDIA GPU. SASS gives performance — the instruction schedule, register allocation, and memory access patterns are optimised for the specific chip. NVIDIA ships both PTX and SASS in their libraries. If your GPU architecture wasn't anticipated at compile time (e.g., you're running a library compiled for Ampere on a newer Hopper GPU), the driver JIT-compiles the embedded PTX into SASS at runtime. You get a working kernel immediately, though the first launch may be slightly slower while the JIT runs.
Putting It All Together
Let's trace a single call end-to-end, from the moment you press Enter to the moment a result appears in your tensor. This is the full journey of
result = torch.matmul(A, B)
where
A
and
B
are 1024x1024 float32 tensors on a CUDA GPU:
You write: result = torch.matmul(A, B)
1. Python: torch.matmul validates shapes, calls C++ via pybind11
2. C++: at::matmul dispatches based on device=CUDA, dtype=float32
3. Dispatch: Routes to cuBLAS sgemm (single-precision general matrix multiply)
4. cuBLAS: Executes pre-compiled SASS on the GPU's Streaming Multiprocessors
5. Hardware: SASS instructions toggle ALU transistors, multiplying and accumulating
6. Return: Result tensor's storage is filled, metadata set, returned to Python
Total wall time: ~0.1 ms for a 1024×1024 matmul on a modern GPU
Python overhead: ~5 μs (the pybind11 call and dispatch)
Actual compute: ~95 μs (GPU execution)
The numbers above are approximate, but the ratio is what matters: the Python-side overhead (argument validation, pybind11 crossing, dispatch lookup) is typically around 5 microseconds, while the GPU compute for a reasonably sized operation is tens to hundreds of microseconds or more. The Python layer takes roughly 5% of the total wall time for a 1024x1024 matmul, and an even smaller fraction for larger operations.
This ratio explains an important practical observation: eager mode works well enough for most training workloads . When each operation takes hundreds of microseconds on the GPU, the 5-microsecond Python overhead is noise. You only start feeling the Python overhead when individual operations are very small (a few microseconds of GPU work each) and very numerous — thousands of tiny operations per forward pass. In that regime, the Python overhead per operation starts to dominate, and you can accumulate milliseconds of wasted time per iteration.
There's one more subtlety worth noting. GPU operations are
asynchronous
by default. When Python calls into cuBLAS, it doesn't wait for the GPU to finish — it
enqueues
the operation on the GPU's command queue (called a CUDA stream) and returns immediately. Python can then continue executing the next line of code (typically enqueuing the next operation) while the GPU is still working on the previous one. The GPU and CPU run in parallel, like a pipeline. You only pay a synchronisation cost if you explicitly read the result back to the CPU (e.g.,
result.item()
or
print(result)
), which forces the CPU to wait until the GPU is done. This asynchronous execution model further reduces the practical impact of Python overhead, because the CPU is usually just feeding the GPU's queue faster than the GPU can drain it.
Quiz
Test your understanding of PyTorch's execution stack, from the Python frontend to GPU hardware.
What is the role of pybind11 in the PyTorch stack?
How does the ATen dispatcher decide which kernel to run for a given operation?
What is the difference between PTX and SASS in the CUDA compilation pipeline?
Why does pip install torch include pre-compiled SASS binaries (cuBLAS, cuDNN)?