From Python to GPU: The Execution Stack

The Five Layers

When you write torch.matmul(A, B) , a surprisingly deep stack of software transforms that one-line Python call into machine code running on a GPU. There are at least five distinct layers between your script and the transistors that do the actual arithmetic, and each one exists for a good reason. This article traces the full path — from Python to silicon — so you can see exactly where your computation lives at each stage.

┌──────────────────────────────────────────────────────────────┐
│  Layer 1: Python Frontend                                    │
│  torch.matmul(A, B)                                          │
├──────────────────────────────────────────────────────────────┤
│  Layer 2: pybind11 Binding                                   │
│  Python objects → C++ objects (memory pointers, shapes)      │
├──────────────────────────────────────────────────────────────┤
│  Layer 3: C++ Backend (ATen)                                 │
│  at::matmul → dispatcher → route by device/dtype             │
├──────────────────────────────────────────────────────────────┤
│  Layer 4: Kernel Libraries                                   │
│  cuBLAS (linear algebra) / cuDNN (DNN ops)                   │
│  Pre-compiled SASS binaries shipped with pip install torch   │
├──────────────────────────────────────────────────────────────┤
│  Layer 5: Hardware                                           │
│  SASS → microcode → electrical signals → ALU transistors     │
└──────────────────────────────────────────────────────────────┘

Here is the one-sentence summary of each layer. Layer 1 (Python Frontend) is the familiar API you interact with — it validates inputs and records autograd operations but performs no math. Layer 2 (pybind11) is the language bridge that translates Python objects into C++ objects, passing memory pointers and tensor metadata across the boundary. Layer 3 (ATen + Dispatcher) is the C++ routing layer that examines each tensor's device, dtype, and layout and selects the right kernel implementation. Layer 4 (Kernel Libraries) is where the actual computation happens — vendor-optimised libraries like cuBLAS and cuDNN that ship as pre-compiled GPU assembly. Layer 5 (Hardware) is the physical substrate: GPU assembly instructions decoded into electrical signals that toggle transistors in the chip's arithmetic units.

The rest of this article walks through each layer in detail, building a mental model of what happens from the moment you press Enter to the moment a result appears in your tensor.

Layer 1: Python Frontend

torch.matmul(A, B) is a Python function defined in PyTorch's Python package. It lives in a file you can inspect yourself ( torch/functional.py or similar, depending on the version), and it does almost no computation. Its primary job is to validate inputs : are the shapes compatible for matrix multiplication? Are both tensors on the same device? Is the dtype supported? If something is wrong, you get a clear Python-level error message rather than a cryptic segfault from C++.

Once validation passes, the Python function calls into C++ via pybind11 . pybind11 is a lightweight header-only library that generates Python bindings for C++ code. It handles the messy work of translating Python objects ( torch.Tensor ) into their C++ counterparts ( at::Tensor ), passing across the language boundary everything the C++ side needs: raw memory pointers to the tensor's storage, shape and stride metadata, dtype information, and the device tag (CPU, CUDA, MPS, etc.). This binding layer is why PyTorch feels like Python — you write Pythonic code and get Pythonic errors — but runs at C++ speed.

💡 The Python layer is also where autograd hooks live. When a tensor has requires_grad=True, the Python-side wrapper records the operation in the computational graph before passing the actual math to C++. This means the graph-building overhead is in Python, but the forward computation itself is in C++. For most workloads the graph-building cost is negligible compared to the actual kernel execution time.

Layer 2: The C++ Dispatcher

Inside the C++ backend, the function call lands in ATen (short for A Tensor library ) — PyTorch's core tensor operations library. ATen provides the canonical C++ signature for every operation (matmul, add, conv2d, and hundreds more), but it does not implement matmul for every device/dtype combination directly. Instead, it delegates to a dispatcher .

The dispatcher is essentially a routing table. When at::matmul is called, the dispatcher examines the tensor's metadata — device, dtype, layout, and whether autograd tracing is active — and selects the right kernel implementation. The same C++ entry point at::matmul can route to:

A CPU implementation (using Intel MKL or OpenBLAS) if the tensors live in main memory
A cuBLAS call if the tensors are on a CUDA GPU
A different kernel entirely if mixed precision, sparse layout, or a quantised dtype is involved

This design is what makes PyTorch extensible. New backends — a TPU backend, an Apple MPS backend, a custom accelerator from a hardware startup — can be added by registering new dispatch keys . The frontend Python code doesn't change at all. The user still writes torch.matmul(A, B) ; the dispatcher handles routing it to the right place based on where A and B happen to live. This separation of interface from implementation is one of the most important architectural decisions in PyTorch's codebase.

The dispatch mechanism also handles composable transforms . Autograd, torch.compile's tracing, vmap (vectorised batching), and functionalization are all implemented as dispatch keys that can be stacked. When you call torch.matmul on a tensor that requires gradients and is being compiled, the dispatcher chains through the Autograd key (which records the operation for backpropagation), then the compilation key (which captures the operation into a graph), and finally the device-specific key (which executes the actual math). Each key does its job and forwards to the next.

Layer 3: CUDA and the Kernel Libraries

For GPU tensors, the dispatcher routes to CUDA kernels. PyTorch doesn't implement most CUDA kernels from scratch — it calls into NVIDIA's vendor-optimised libraries, which have been tuned over many years for each GPU microarchitecture. The two most important libraries are:

cuBLAS (CUDA Basic Linear Algebra Subprograms) provides highly optimised implementations of fundamental linear algebra operations: matrix multiplication, dot products, matrix-vector multiplies, triangular solves, and more. These are the workhorses behind every linear layer in a neural network. When you compute output = weight @ input + bias , the matrix multiplication ultimately lands in a cuBLAS sgemm (single-precision general matrix multiply) or hgemm (half-precision) call. NVIDIA engineers have spent years optimising these routines, and they typically achieve over 80% of the GPU's theoretical peak FLOPS — far better than a naive CUDA kernel would.

cuDNN (CUDA Deep Neural Network library) provides optimised implementations of higher-level deep learning primitives: convolution, pooling, batch normalisation, softmax, and (in newer versions) multi-head attention. These are the workhorses behind CNN layers and transformer attention blocks. cuDNN doesn't just implement these operations naively — it selects among multiple algorithms (e.g., Winograd convolution, FFT-based convolution, direct convolution) and benchmarks them at runtime to find the fastest one for your specific tensor shapes and GPU.

Both cuBLAS and cuDNN ship as pre-compiled SASS binaries inside the torch Python package. When you run pip install torch , you're downloading not just Python code but hundreds of megabytes of pre-compiled GPU assembly — optimised machine code ready to run on your GPU without any compilation step on your end.

# What "pip install torch" puts on your system:
#
# torch/
# ├── __init__.py              ← Python frontend
# ├── _C.so                    ← C++ backend (compiled via pybind11)
# ├── lib/
# │   ├── libtorch.so          ← ATen, autograd, dispatcher
# │   ├── libtorch_cuda.so     ← CUDA dispatch implementations
# │   ├── libcublas.so         ← cuBLAS (pre-compiled SASS)
# │   ├── libcudnn.so          ← cuDNN (pre-compiled SASS)
# │   └── ...
# └── nn/, optim/, ...         ← High-level Python modules

This explains why pip install torch downloads over 2 GB: you're getting a complete numerical computing stack, from Python wrappers all the way down to GPU machine code, in a single package.

Layer 4: From Source Code to Silicon

The cuBLAS and cuDNN binaries didn't appear from thin air — someone at NVIDIA wrote C++ CUDA source code and compiled it through a multi-stage pipeline. Understanding this pipeline is valuable because the same stages apply whenever anyone writes a custom CUDA kernel (including the kernels that torch.compile generates via Triton, which we'll cover in a later article).

Source code                  NVCC compiler               Hardware
┌──────────┐    ┌──────────────────────────┐    ┌──────────────┐
│ C++ CUDA │ →  │ .cu  →  PTX  →  SASS     │ →  │ SASS → μcode │
│ (.cu)    │    │                          │    │ → transistors│
└──────────┘    │ PTX: portable IR         │    └──────────────┘
                │   (like LLVM IR for GPUs)│
                │                          │
                │ SASS: GPU-specific ASM   │
                │   (different per chip:   │
                │   Ampere, Hopper, etc.)  │
                └──────────────────────────┘

PTX (Parallel Thread Execution) is NVIDIA's virtual instruction set architecture. It's a portable intermediate representation — the same PTX code can, in principle, run on any NVIDIA GPU, regardless of generation. Think of it as something analogous to LLVM IR or Java bytecode: a stable abstraction layer that decouples the source language from the target hardware. PTX instructions describe operations like "multiply two floats" or "load from global memory" without specifying exactly which hardware units will execute them.

SASS (Streaming ASSembly) is the actual GPU machine code, specific to a particular GPU microarchitecture (identified by an SM version number). PTX is compiled to SASS by ptxas , NVIDIA's assembler (part of the CUDA Toolkit). Different GPU architectures produce different SASS from the same PTX — an Ampere GPU (SM 8.0) and a Hopper GPU (SM 9.0) will generate different instruction sequences, each tuned for that chip's pipeline widths, register file sizes, and memory hierarchy.

Microcode is the final stage. At load time, the GPU's control unit decodes SASS instructions into electrical signals that toggle specific transistors in the Streaming Multiprocessor's ALU (Arithmetic Logic Unit) and register files. This is where computation becomes physics: a multiply-accumulate instruction turns into voltage changes that ripple through transistor gates, producing the sum of products that makes up a matrix multiplication.

Why two compilation stages (PTX then SASS) instead of compiling directly to SASS? The answer is a classic tradeoff between portability and performance . PTX gives portability — write your kernel once, and it can run on any current or future NVIDIA GPU. SASS gives performance — the instruction schedule, register allocation, and memory access patterns are optimised for the specific chip. NVIDIA ships both PTX and SASS in their libraries. If your GPU architecture wasn't anticipated at compile time (e.g., you're running a library compiled for Ampere on a newer Hopper GPU), the driver JIT-compiles the embedded PTX into SASS at runtime. You get a working kernel immediately, though the first launch may be slightly slower while the JIT runs.

Putting It All Together

Let's trace a single call end-to-end, from the moment you press Enter to the moment a result appears in your tensor. This is the full journey of result = torch.matmul(A, B) where A and B are 1024x1024 float32 tensors on a CUDA GPU:

You write:      result = torch.matmul(A, B)

1. Python:      torch.matmul validates shapes, calls C++ via pybind11
2. C++:         at::matmul dispatches based on device=CUDA, dtype=float32
3. Dispatch:    Routes to cuBLAS sgemm (single-precision general matrix multiply)
4. cuBLAS:      Executes pre-compiled SASS on the GPU's Streaming Multiprocessors
5. Hardware:    SASS instructions toggle ALU transistors, multiplying and accumulating
6. Return:      Result tensor's storage is filled, metadata set, returned to Python

Total wall time: ~0.1 ms for a 1024×1024 matmul on a modern GPU
Python overhead: ~5 μs (the pybind11 call and dispatch)
Actual compute:  ~95 μs (GPU execution)

The numbers above are approximate, but the ratio is what matters: the Python-side overhead (argument validation, pybind11 crossing, dispatch lookup) is typically around 5 microseconds, while the GPU compute for a reasonably sized operation is tens to hundreds of microseconds or more. The Python layer takes roughly 5% of the total wall time for a 1024x1024 matmul, and an even smaller fraction for larger operations.

This ratio explains an important practical observation: eager mode works well enough for most training workloads . When each operation takes hundreds of microseconds on the GPU, the 5-microsecond Python overhead is noise. You only start feeling the Python overhead when individual operations are very small (a few microseconds of GPU work each) and very numerous — thousands of tiny operations per forward pass. In that regime, the Python overhead per operation starts to dominate, and you can accumulate milliseconds of wasted time per iteration.

💡 This is exactly the scenario where torch.compile (article 4) helps. By tracing your Python code into a graph and fusing many small operations into fewer large kernels, torch.compile eliminates the per-operation Python overhead entirely. Instead of crossing the Python-to-C++ boundary a thousand times, you cross it once, hand over the entire graph, and let the compiler generate a single optimised kernel.

There's one more subtlety worth noting. GPU operations are asynchronous by default. When Python calls into cuBLAS, it doesn't wait for the GPU to finish — it enqueues the operation on the GPU's command queue (called a CUDA stream) and returns immediately. Python can then continue executing the next line of code (typically enqueuing the next operation) while the GPU is still working on the previous one. The GPU and CPU run in parallel, like a pipeline. You only pay a synchronisation cost if you explicitly read the result back to the CPU (e.g., result.item() or print(result) ), which forces the CPU to wait until the GPU is done. This asynchronous execution model further reduces the practical impact of Python overhead, because the CPU is usually just feeding the GPU's queue faster than the GPU can drain it.

Quiz

Test your understanding of PyTorch's execution stack, from the Python frontend to GPU hardware.

What is the role of pybind11 in the PyTorch stack?

It translates Python objects (torch.Tensor) into C++ objects (at::Tensor), passing memory pointers and metadata across the language boundary

It compiles Python code into C++ for faster execution

It manages GPU memory allocation and deallocation

It optimises tensor operations by fusing them into larger kernels

How does the ATen dispatcher decide which kernel to run for a given operation?

It profiles all available kernels and picks the fastest one at runtime

It always routes to the CUDA implementation if a GPU is present

It examines the tensor's device, dtype, and layout metadata and routes to the matching implementation

It uses a neural network to predict the optimal kernel

What is the difference between PTX and SASS in the CUDA compilation pipeline?

PTX is for training and SASS is for inference

PTX is a portable intermediate representation that works across GPU generations; SASS is GPU-specific machine code optimised for a particular microarchitecture

PTX is slower but more accurate; SASS is faster but less precise

PTX runs on NVIDIA GPUs and SASS runs on AMD GPUs

Why does pip install torch include pre-compiled SASS binaries (cuBLAS, cuDNN)?

Because Python cannot interface with source code directly

Because NVIDIA requires all CUDA code to be pre-compiled for licensing reasons

So that common operations (matmul, convolution, etc.) run at maximum speed without requiring compilation at install time

Because SASS binaries are smaller than Python source files