Three Layers of CUDA
The word "CUDA" is overloaded. Depending on context, it can refer to a device architecture, a programming model, or a software platform. This article focuses on the software platform — the layers of code that sit between your Python script (or C++ program) and the GPU hardware. Understanding this stack matters because every performance debugging session eventually bottoms out in one of these layers, and knowing which layer you're looking at determines whether you need to update a driver, swap a library, or rewrite a kernel.
The stack is organised into three layers, each building on the one below it:
- Driver layer: nvidia.ko (the Linux kernel module) and libcuda.so (the CUDA Driver API). This is the boundary between the operating system and the GPU silicon.
- Runtime layer: libcudart.so (the CUDA Runtime API), nvcc (the compiler driver), and nvrtc (the runtime compiler). This is the layer most CUDA C++ developers interact with directly.
- Library layer: cuBLAS, cuDNN, cuFFT, cuSPARSE, and others. Pre-built, hand-tuned kernels for common operations. This is the layer that PyTorch, TensorFlow, and JAX call into.
The following diagram shows how these layers stack (see the Modal GPU Glossary on the CUDA software platform for additional context):
Your Code (Python / C++ CUDA)
│
▼
┌───────────────────────────────────────────────┐
│ Library Layer │
│ cuBLAS (matmul) · cuDNN (conv, attn) · ... │
├───────────────────────────────────────────────┤
│ Runtime Layer │
│ libcudart.so · nvcc · nvrtc │
├───────────────────────────────────────────────┤
│ Driver Layer │
│ libcuda.so · nvidia.ko (kernel module) │
├───────────────────────────────────────────────┤
│ Hardware │
│ GPU (SMs, memory, interconnect) │
└───────────────────────────────────────────────┘
Each layer has a well-defined responsibility. The driver layer manages the GPU as a hardware device — allocating memory, scheduling work, handling interrupts. The runtime layer provides a more ergonomic programming API and the compilation toolchain that turns
.cu
source files into executable GPU code. The library layer offers pre-compiled, battle-tested kernels for the operations that dominate deep learning and scientific computing workloads — matrix multiplies, convolutions, FFTs, and so on.
The separation is not merely conceptual. Each layer ships as a distinct set of shared libraries with its own versioning. A driver update (nvidia.ko + libcuda.so) does not require recompiling your CUDA code; a cuDNN upgrade does not require a new driver. This decoupling is one reason CUDA has been so durable — it allows NVIDIA to improve any layer independently, and it allows users to pin specific versions when stability matters.
The Driver Layer
The driver layer is the lowest software layer in the CUDA stack — it talks directly to the GPU hardware and mediates all access to it. Everything above (the runtime API, the libraries, your kernels) eventually funnels through this layer.
nvidia.ko:
the Linux kernel module. When loaded, it registers the GPU as a device with the operating system, manages GPU memory allocation at the physical level, schedules kernel launches on the GPU's command processors, and handles hardware interrupts. Without this module loaded (
lsmod | grep nvidia
will tell you), the GPU is invisible to the OS — it's just an unknown PCIe device. On systems with multiple GPUs, a single instance of the module manages all of them.
libcuda.so (CUDA Driver API):
the user-space shared library that communicates with nvidia.ko via
ioctl
system calls. It provides low-level functions for every step of GPU programming: create a CUDA context (analogous to a process on the CPU), allocate device memory, load compiled kernels from PTX or SASS binary, set kernel arguments, and launch kernels. The driver API is verbose and explicit — you manage contexts, modules, and memory manually. Most developers never touch it directly; the runtime API (libcudart.so) wraps it with a friendlier interface. But understanding that the driver API exists is important because it's the layer where version compatibility is enforced: the CUDA driver version determines the maximum CUDA toolkit version your system supports.
nvidia-smi:
a command-line tool built on NVML (libnvml.so) that shows GPU status at a glance — temperature, memory usage, GPU utilisation, power draw, running processes, and the driver version. Think of it as the
top
or
htop
command for GPUs. It's typically the first tool you reach for when debugging GPU issues: "Is the GPU visible? Is anything running on it? How much memory is free?"
The Runtime Layer
The runtime layer sits above the driver and provides two things: a friendlier programming API, and the compilation toolchain that turns CUDA source code into executable GPU binaries.
libcudart.so (CUDA Runtime API):
wraps the driver API with simpler, more automatic functions. Instead of manually creating CUDA contexts, loading modules, and setting kernel arguments one by one, you write
cudaMalloc
to allocate device memory,
cudaMemcpy
to transfer data between host and device, and the
kernel<<<grid, block>>>(args)
triple-chevron syntax to launch kernels. The runtime API manages CUDA contexts automatically (one per thread by default) and handles lazy initialisation, so you don't need boilerplate setup code. The vast majority of CUDA C++ code uses the runtime API, not the driver API.
nvcc (NVIDIA CUDA Compiler Driver):
compiles
.cu
files — C++ source with CUDA extensions — into executable code. But nvcc is not a single monolithic compiler. It's an orchestrator that splits the work between the host compiler and NVIDIA's device compiler pipeline:
your_kernel.cu (C++ CUDA source)
│
▼ nvcc separates host and device code
├── Host code → GCC/Clang → CPU executable
└── Device code → cicc → PTX → ptxas → SASS
│
▼
Embedded in the executable
(or .cubin / .fatbin file)
The process works as follows. nvcc reads the
.cu
file and separates it into two streams:
host code
(regular C++ — everything outside of
__global__
and
__device__
functions) and
device code
(the GPU kernels). Host code is passed to the system's C++ compiler — GCC or Clang on Linux, MSVC on Windows — and compiled to a normal CPU object file. Device code goes through NVIDIA's own compiler pipeline: first to
cicc
(which produces PTX, a portable intermediate representation), then to
ptxas
(which compiles PTX to SASS, the actual GPU machine code for a specific architecture). The two are then linked together into a single executable, with the SASS binary embedded as a data section.
nvrtc (Runtime Compiler): allows compiling CUDA C++ kernels at runtime rather than ahead-of-time with nvcc. You pass a string of CUDA source code to nvrtc, and it returns compiled PTX (or SASS) that you can then load and launch via the driver API. This is the mechanism behind just-in-time (JIT) compilation in the CUDA ecosystem. Libraries that generate GPU kernels dynamically — like Triton, XLA, and TVM — rely heavily on nvrtc (or on PTX JIT compilation in the driver). JIT compilation adds startup latency, but it enables generating kernels specialised for specific shapes, dtypes, or hardware features at runtime, which can yield better performance than static compilation.
The Library Layer: cuBLAS and cuDNN
The library layer provides pre-compiled, highly optimised GPU kernels for common computational patterns. These libraries are the real workhorses behind deep learning frameworks — when you call
torch.matmul
or
nn.Conv2d
on a GPU tensor, PyTorch dispatches to one of these libraries rather than running a naive CUDA kernel.
cuBLAS
(CUDA Basic Linear Algebra Subprograms): the GPU implementation of the BLAS standard. It provides matrix multiplication (
sgemm
for FP32,
dgemm
for FP64,
hgemm
for FP16), dot products, matrix-vector products, triangular solves, and other linear algebra primitives. Every
torch.matmul
,
torch.mm
, and
nn.Linear
forward pass eventually calls cuBLAS on GPU.
The naming convention follows the BLAS tradition:
cublas[S|D|H]gemm
where S = single precision (FP32), D = double (FP64), H = half (FP16).
gemm
stands for General Matrix Multiply:
The
α
and
β
scalars allow fusing a scale and accumulate into the matmul itself, avoiding a separate kernel launch. In deep learning,
α = 1
and
β = 0
is the common case (plain matrix multiply), but the general form is useful for operations like adding a bias or accumulating gradients.
cuDNN
(CUDA Deep Neural Network library): provides GPU implementations of the operations that dominate neural network computation — convolutions (with multiple algorithms: Winograd, FFT-based, implicit GEMM), pooling, batch normalisation, softmax, activation functions, and in newer versions, fused multi-head attention (including FlashAttention-style implementations). Every
nn.Conv2d
,
F.softmax
, and
nn.BatchNorm2d
eventually calls cuDNN on GPU.
Why pre-compiled rather than compiled at install time? Because performance at this level requires hand-tuning by teams of GPU architects who exploit hardware-specific features — Tensor Core warp-level matrix operations, asynchronous memory copy instructions, shared memory swizzling patterns, register-level optimisations — that automated compilers generally cannot discover. The resulting SASS binaries are often 10–100× faster than a naive CUDA implementation of the same operation. NVIDIA effectively amortises the cost of this expert-level tuning across every user of the library.
Compute Capability and PTX Compatibility
Not all NVIDIA GPUs support the same features. Tensor Cores, BF16 arithmetic, FP8, and hardware-accelerated memory copy instructions were each introduced in a specific GPU generation. NVIDIA tracks this with Compute Capability — a versioning scheme that tells software exactly which hardware features are available on a given GPU.
The major milestones relevant to modern deep learning:
- SM 7.0 (Volta — V100): introduced first-generation Tensor Cores (FP16 matrix multiply-accumulate). This was the GPU that made mixed-precision training practical.
- SM 8.0 (Ampere — A100): added BF16 support, TF32 (a 19-bit format for transparent FP32 speedups), 3rd-generation Tensor Cores, and sparsity support.
- SM 8.9 (Ada Lovelace — RTX 4090): added FP8 (E4M3 and E5M2 formats) and 4th-generation Tensor Cores. FP8 further halves the memory footprint relative to FP16.
- SM 9.0 (Hopper — H100): added the Tensor Memory Accelerator (TMA) for asynchronous bulk data movement, warp-group level matrix operations (WGMMA), and the Transformer Engine for dynamic FP8/FP16 switching.
Why does this matter for the software stack? Because SASS — the final GPU machine code — is specific to a compute capability . SASS compiled for SM 8.0 will not run on an SM 7.0 GPU, and it also won't automatically use SM 9.0 features if you run it on an H100. PTX, by contrast, is forward-compatible : PTX compiled for SM 7.0 can be JIT-compiled by the driver to SASS on an SM 9.0 GPU. The trade-off is startup latency — JIT compilation happens at load time — and potentially missed optimisations that the driver's JIT compiler doesn't know about.
This is why
fat binaries
exist. A single
.fatbin
file can contain SASS for multiple compute capabilities
plus
PTX as a fallback. At runtime, the driver selects the best available SASS for the current GPU, falling back to JIT-compiling the PTX if no matching SASS is found. This ensures the code runs on any CUDA-capable GPU — with optimal performance on architectures that have pre-compiled SASS, and slightly slower JIT-compiled performance on others. It also explains why CUDA libraries like cuBLAS are so large: they bundle SASS for every supported GPU generation.
import numpy as np
# Compute capabilities and their key features
architectures = [
("SM 7.0", "Volta (V100)", 2017, "1st-gen Tensor Cores, FP16 MMA"),
("SM 7.5", "Turing (RTX 2080)", 2018, "INT8 Tensor Cores, RT Cores"),
("SM 8.0", "Ampere (A100)", 2020, "BF16, TF32, 3rd-gen Tensor Cores"),
("SM 8.6", "Ampere (RTX 3090)", 2020, "Consumer Ampere, same TC gen"),
("SM 8.9", "Ada Lovelace (4090)", 2022, "FP8, 4th-gen Tensor Cores"),
("SM 9.0", "Hopper (H100)", 2022, "TMA, WGMMA, Transformer Engine"),
]
print("NVIDIA Compute Capability Timeline")
print("=" * 72)
print(f"{'SM':>6} {'Architecture':<25} {'Year':<6} {'Key Features'}")
print("-" * 72)
for sm, arch, year, features in architectures:
print(f"{sm:>6} {arch:<25} {year:<6} {features}")
print()
print("Compatibility rules:")
print(" SASS: runs ONLY on the exact SM version it was compiled for")
print(" PTX: forward-compatible (SM 7.0 PTX → JIT-compiled on SM 9.0)")
print(" Fat binary: bundles multiple SASS + PTX fallback → runs everywhere")
Profiling: nvidia-smi, Nsight, and CUPTI
Once your code runs on the GPU, the next question is whether it's running well . Is it using the Tensor Cores? Is it bottlenecked on memory bandwidth? Is it spending more time copying data than computing? The CUDA ecosystem provides a suite of profiling tools at different levels of granularity, and choosing the right one depends on what question you're asking.
-
nvidia-smi:
the quick health check. Shows GPU utilisation, memory usage, temperature, power draw, and which processes are using the GPU. It's the first tool you use — not for deep performance analysis, but to confirm the GPU is active and roughly how loaded it is. Run
nvidia-smi dmonfor continuous monitoring. -
NVML (libnvml.so):
the C library behind nvidia-smi. Provides programmatic access to the same GPU metrics — useful for building dashboards, alerting systems, or logging GPU health in production. Libraries like
pynvmlwrap it for Python. - Nsight Systems: a system-wide profiler that traces the full execution timeline — CPU activity, GPU kernel launches, memory transfers (H2D, D2H, D2D), CUDA API calls, and their temporal relationships. It answers macro-level questions: where is time being spent? Is the GPU idle between kernel launches? Are memory copies overlapping with compute? Is the CPU the bottleneck? The output is a timeline visualisation where you can zoom in on specific intervals.
- Nsight Compute: a kernel-level profiler that deep-dives into a single GPU kernel's performance. It measures occupancy (how many warps are active relative to the maximum), memory bandwidth utilisation, compute throughput, instruction mix, stall reasons, and warp-level statistics. It answers micro-level questions: is this kernel compute-bound or memory-bound? Where are the warps stalling? Is shared memory being used efficiently?
- CUPTI (CUDA Profiling Tools Interface): the low-level API that both Nsight Systems and Nsight Compute build on. CUPTI provides access to hardware performance counters, activity tracing, and callback mechanisms. Third-party tools (like PyTorch Profiler's CUDA tracing) also use CUPTI under the hood.
The typical profiling workflow goes from coarse to fine: start with nvidia-smi to verify the GPU is active, use Nsight Systems to find which kernels or memory operations dominate the runtime, and then use Nsight Compute to drill into the specific kernels that look suboptimal. Jumping straight to Nsight Compute without first understanding the system-level picture tends to produce optimisations that don't matter — you might perfect a kernel that only accounts for 2% of total runtime.
Quiz
Test your understanding of the CUDA software stack — the driver, runtime, and library layers, compilation pipeline, compute capabilities, and profiling tools.
What does nvcc do when compiling a .cu file?
Why are cuBLAS and cuDNN SASS binaries shipped pre-compiled rather than compiled at install time?
What is Compute Capability and why does it matter for the software stack?
What is the difference between Nsight Systems and Nsight Compute?