The CUDA Software Stack

Three Layers of CUDA

The word "CUDA" is overloaded. Depending on context, it can refer to a device architecture, a programming model, or a software platform. This article focuses on the software platform — the layers of code that sit between your Python script (or C++ program) and the GPU hardware. Understanding this stack matters because every performance debugging session eventually bottoms out in one of these layers, and knowing which layer you're looking at determines whether you need to update a driver, swap a library, or rewrite a kernel.

The stack is organised into three layers, each building on the one below it:

Driver layer: nvidia.ko (the Linux kernel module) and libcuda.so (the CUDA Driver API). This is the boundary between the operating system and the GPU silicon.
Runtime layer: libcudart.so (the CUDA Runtime API), nvcc (the compiler driver), and nvrtc (the runtime compiler). This is the layer most CUDA C++ developers interact with directly.
Library layer: cuBLAS, cuDNN, cuFFT, cuSPARSE, and others. Pre-built, hand-tuned kernels for common operations. This is the layer that PyTorch, TensorFlow, and JAX call into.

The following diagram shows how these layers stack (see the Modal GPU Glossary on the CUDA software platform for additional context):

Your Code (Python / C++ CUDA)
        │
        ▼
┌───────────────────────────────────────────────┐
│  Library Layer                                │
│  cuBLAS (matmul) · cuDNN (conv, attn) · ...   │
├───────────────────────────────────────────────┤
│  Runtime Layer                                │
│  libcudart.so · nvcc · nvrtc                  │
├───────────────────────────────────────────────┤
│  Driver Layer                                 │
│  libcuda.so · nvidia.ko (kernel module)       │
├───────────────────────────────────────────────┤
│  Hardware                                     │
│  GPU (SMs, memory, interconnect)              │
└───────────────────────────────────────────────┘

Each layer has a well-defined responsibility. The driver layer manages the GPU as a hardware device — allocating memory, scheduling work, handling interrupts. The runtime layer provides a more ergonomic programming API and the compilation toolchain that turns .cu source files into executable GPU code. The library layer offers pre-compiled, battle-tested kernels for the operations that dominate deep learning and scientific computing workloads — matrix multiplies, convolutions, FFTs, and so on.

The separation is not merely conceptual. Each layer ships as a distinct set of shared libraries with its own versioning. A driver update (nvidia.ko + libcuda.so) does not require recompiling your CUDA code; a cuDNN upgrade does not require a new driver. This decoupling is one reason CUDA has been so durable — it allows NVIDIA to improve any layer independently, and it allows users to pin specific versions when stability matters.

The Driver Layer

The driver layer is the lowest software layer in the CUDA stack — it talks directly to the GPU hardware and mediates all access to it. Everything above (the runtime API, the libraries, your kernels) eventually funnels through this layer.

nvidia.ko: the Linux kernel module. When loaded, it registers the GPU as a device with the operating system, manages GPU memory allocation at the physical level, schedules kernel launches on the GPU's command processors, and handles hardware interrupts. Without this module loaded ( lsmod | grep nvidia will tell you), the GPU is invisible to the OS — it's just an unknown PCIe device. On systems with multiple GPUs, a single instance of the module manages all of them.

libcuda.so (CUDA Driver API): the user-space shared library that communicates with nvidia.ko via ioctl system calls. It provides low-level functions for every step of GPU programming: create a CUDA context (analogous to a process on the CPU), allocate device memory, load compiled kernels from PTX or SASS binary, set kernel arguments, and launch kernels. The driver API is verbose and explicit — you manage contexts, modules, and memory manually. Most developers never touch it directly; the runtime API (libcudart.so) wraps it with a friendlier interface. But understanding that the driver API exists is important because it's the layer where version compatibility is enforced: the CUDA driver version determines the maximum CUDA toolkit version your system supports.

nvidia-smi: a command-line tool built on NVML (libnvml.so) that shows GPU status at a glance — temperature, memory usage, GPU utilisation, power draw, running processes, and the driver version. Think of it as the top or htop command for GPUs. It's typically the first tool you reach for when debugging GPU issues: "Is the GPU visible? Is anything running on it? How much memory is free?"

💡 nvidia-smi shows 'GPU Memory Usage' but this is allocated memory, not necessarily in-use memory. PyTorch's caching allocator often reserves more GPU memory than it actively needs, so nvidia-smi may show 90%+ usage even when a model is only using 50%. To see actual PyTorch memory usage, use torch.cuda.memory_allocated() and torch.cuda.memory_reserved() instead.

The Runtime Layer

The runtime layer sits above the driver and provides two things: a friendlier programming API, and the compilation toolchain that turns CUDA source code into executable GPU binaries.

libcudart.so (CUDA Runtime API): wraps the driver API with simpler, more automatic functions. Instead of manually creating CUDA contexts, loading modules, and setting kernel arguments one by one, you write cudaMalloc to allocate device memory, cudaMemcpy to transfer data between host and device, and the kernel<<<grid, block>>>(args) triple-chevron syntax to launch kernels. The runtime API manages CUDA contexts automatically (one per thread by default) and handles lazy initialisation, so you don't need boilerplate setup code. The vast majority of CUDA C++ code uses the runtime API, not the driver API.

nvcc (NVIDIA CUDA Compiler Driver): compiles .cu files — C++ source with CUDA extensions — into executable code. But nvcc is not a single monolithic compiler. It's an orchestrator that splits the work between the host compiler and NVIDIA's device compiler pipeline:

your_kernel.cu (C++ CUDA source)
     │
     ▼ nvcc separates host and device code
     ├── Host code → GCC/Clang → CPU executable
     └── Device code → cicc → PTX → ptxas → SASS
                                              │
                                              ▼
                                    Embedded in the executable
                                    (or .cubin / .fatbin file)

The process works as follows. nvcc reads the .cu file and separates it into two streams: host code (regular C++ — everything outside of __global__ and __device__ functions) and device code (the GPU kernels). Host code is passed to the system's C++ compiler — GCC or Clang on Linux, MSVC on Windows — and compiled to a normal CPU object file. Device code goes through NVIDIA's own compiler pipeline: first to cicc (which produces PTX, a portable intermediate representation), then to ptxas (which compiles PTX to SASS, the actual GPU machine code for a specific architecture). The two are then linked together into a single executable, with the SASS binary embedded as a data section.

nvrtc (Runtime Compiler): allows compiling CUDA C++ kernels at runtime rather than ahead-of-time with nvcc. You pass a string of CUDA source code to nvrtc, and it returns compiled PTX (or SASS) that you can then load and launch via the driver API. This is the mechanism behind just-in-time (JIT) compilation in the CUDA ecosystem. Libraries that generate GPU kernels dynamically — like Triton, XLA, and TVM — rely heavily on nvrtc (or on PTX JIT compilation in the driver). JIT compilation adds startup latency, but it enables generating kernels specialised for specific shapes, dtypes, or hardware features at runtime, which can yield better performance than static compilation.

The Library Layer: cuBLAS and cuDNN

The library layer provides pre-compiled, highly optimised GPU kernels for common computational patterns. These libraries are the real workhorses behind deep learning frameworks — when you call torch.matmul or nn.Conv2d on a GPU tensor, PyTorch dispatches to one of these libraries rather than running a naive CUDA kernel.

cuBLAS (CUDA Basic Linear Algebra Subprograms): the GPU implementation of the BLAS standard. It provides matrix multiplication ( sgemm for FP32, dgemm for FP64, hgemm for FP16), dot products, matrix-vector products, triangular solves, and other linear algebra primitives. Every torch.matmul , torch.mm , and nn.Linear forward pass eventually calls cuBLAS on GPU.

The naming convention follows the BLAS tradition: cublas[S|D|H]gemm where S = single precision (FP32), D = double (FP64), H = half (FP16). gemm stands for General Matrix Multiply:

The α and β scalars allow fusing a scale and accumulate into the matmul itself, avoiding a separate kernel launch. In deep learning, α = 1 and β = 0 is the common case (plain matrix multiply), but the general form is useful for operations like adding a bias or accumulating gradients.

cuDNN (CUDA Deep Neural Network library): provides GPU implementations of the operations that dominate neural network computation — convolutions (with multiple algorithms: Winograd, FFT-based, implicit GEMM), pooling, batch normalisation, softmax, activation functions, and in newer versions, fused multi-head attention (including FlashAttention-style implementations). Every nn.Conv2d , F.softmax , and nn.BatchNorm2d eventually calls cuDNN on GPU.

Why pre-compiled rather than compiled at install time? Because performance at this level requires hand-tuning by teams of GPU architects who exploit hardware-specific features — Tensor Core warp-level matrix operations, asynchronous memory copy instructions, shared memory swizzling patterns, register-level optimisations — that automated compilers generally cannot discover. The resulting SASS binaries are often 10–100× faster than a naive CUDA implementation of the same operation. NVIDIA effectively amortises the cost of this expert-level tuning across every user of the library.

💡 When you pip install torch, you're downloading roughly 2 GB. Most of that size comes from cuBLAS and cuDNN SASS binaries — hundreds of megabytes of pre-compiled GPU assembly for different operations, dtypes, and GPU architectures. The Python code in PyTorch itself is a tiny fraction of the package.

Compute Capability and PTX Compatibility

Not all NVIDIA GPUs support the same features. Tensor Cores, BF16 arithmetic, FP8, and hardware-accelerated memory copy instructions were each introduced in a specific GPU generation. NVIDIA tracks this with Compute Capability — a versioning scheme that tells software exactly which hardware features are available on a given GPU.

The major milestones relevant to modern deep learning:

SM 7.0 (Volta — V100): introduced first-generation Tensor Cores (FP16 matrix multiply-accumulate). This was the GPU that made mixed-precision training practical.
SM 8.0 (Ampere — A100): added BF16 support, TF32 (a 19-bit format for transparent FP32 speedups), 3rd-generation Tensor Cores, and sparsity support.
SM 8.9 (Ada Lovelace — RTX 4090): added FP8 (E4M3 and E5M2 formats) and 4th-generation Tensor Cores. FP8 further halves the memory footprint relative to FP16.
SM 9.0 (Hopper — H100): added the Tensor Memory Accelerator (TMA) for asynchronous bulk data movement, warp-group level matrix operations (WGMMA), and the Transformer Engine for dynamic FP8/FP16 switching.

Why does this matter for the software stack? Because SASS — the final GPU machine code — is specific to a compute capability . SASS compiled for SM 8.0 will not run on an SM 7.0 GPU, and it also won't automatically use SM 9.0 features if you run it on an H100. PTX, by contrast, is forward-compatible : PTX compiled for SM 7.0 can be JIT-compiled by the driver to SASS on an SM 9.0 GPU. The trade-off is startup latency — JIT compilation happens at load time — and potentially missed optimisations that the driver's JIT compiler doesn't know about.

This is why fat binaries exist. A single .fatbin file can contain SASS for multiple compute capabilities plus PTX as a fallback. At runtime, the driver selects the best available SASS for the current GPU, falling back to JIT-compiling the PTX if no matching SASS is found. This ensures the code runs on any CUDA-capable GPU — with optimal performance on architectures that have pre-compiled SASS, and slightly slower JIT-compiled performance on others. It also explains why CUDA libraries like cuBLAS are so large: they bundle SASS for every supported GPU generation.

import json, js

# Compute capabilities and their key features
architectures = [
    ("SM 7.0", "Volta (V100)",        2017, "1st-gen Tensor Cores, FP16 MMA"),
    ("SM 7.5", "Turing (RTX 2080)",   2018, "INT8 Tensor Cores, RT Cores"),
    ("SM 8.0", "Ampere (A100)",       2020, "BF16, TF32, 3rd-gen Tensor Cores"),
    ("SM 8.6", "Ampere (RTX 3090)",   2020, "Consumer Ampere, same TC gen"),
    ("SM 8.9", "Ada Lovelace (4090)", 2022, "FP8, 4th-gen Tensor Cores"),
    ("SM 9.0", "Hopper (H100)",       2022, "TMA, WGMMA, Transformer Engine"),
]

rows = [[sm, arch, str(year), features] for sm, arch, year, features in architectures]

print("Compatibility rules:")
print("  SASS: runs ONLY on the exact SM version it was compiled for")
print("  PTX:  forward-compatible (SM 7.0 PTX → JIT-compiled on SM 9.0)")
print("  Fat binary: bundles multiple SASS + PTX fallback → runs everywhere")

js.window.py_table_data = json.dumps({
    "headers": ["SM", "Architecture", "Year", "Key Features"],
    "rows": rows
})

Profiling: nvidia-smi, Nsight, and CUPTI

Once your code runs on the GPU, the next question is whether it's running well . Is it using the Tensor Cores? Is it bottlenecked on memory bandwidth? Is it spending more time copying data than computing? The CUDA ecosystem provides a suite of profiling tools at different levels of granularity, and choosing the right one depends on what question you're asking.

nvidia-smi: the quick health check. Shows GPU utilisation, memory usage, temperature, power draw, and which processes are using the GPU. It's the first tool you use — not for deep performance analysis, but to confirm the GPU is active and roughly how loaded it is. Run nvidia-smi dmon for continuous monitoring.
NVML (libnvml.so): the C library behind nvidia-smi. Provides programmatic access to the same GPU metrics — useful for building dashboards, alerting systems, or logging GPU health in production. Libraries like pynvml wrap it for Python.
Nsight Systems: a system-wide profiler that traces the full execution timeline — CPU activity, GPU kernel launches, memory transfers (H2D, D2H, D2D), CUDA API calls, and their temporal relationships. It answers macro-level questions: where is time being spent? Is the GPU idle between kernel launches? Are memory copies overlapping with compute? Is the CPU the bottleneck? The output is a timeline visualisation where you can zoom in on specific intervals.
Nsight Compute: a kernel-level profiler that deep-dives into a single GPU kernel's performance. It measures occupancy (how many warps are active relative to the maximum), memory bandwidth utilisation, compute throughput, instruction mix, stall reasons, and warp-level statistics. It answers micro-level questions: is this kernel compute-bound or memory-bound? Where are the warps stalling? Is shared memory being used efficiently?
CUPTI (CUDA Profiling Tools Interface): the low-level API that both Nsight Systems and Nsight Compute build on. CUPTI provides access to hardware performance counters, activity tracing, and callback mechanisms. Third-party tools (like PyTorch Profiler's CUDA tracing) also use CUPTI under the hood.

The typical profiling workflow goes from coarse to fine: start with nvidia-smi to verify the GPU is active, use Nsight Systems to find which kernels or memory operations dominate the runtime, and then use Nsight Compute to drill into the specific kernels that look suboptimal. Jumping straight to Nsight Compute without first understanding the system-level picture tends to produce optimisations that don't matter — you might perfect a kernel that only accounts for 2% of total runtime.

💡 PyTorch's built-in profiler (torch.profiler) integrates with CUPTI to provide GPU kernel tracing directly from Python. For most deep learning practitioners, this is sufficient — you don't need to install the full Nsight toolchain unless you're writing custom CUDA kernels.

Quiz

Test your understanding of the CUDA software stack — the driver, runtime, and library layers, compilation pipeline, compute capabilities, and profiling tools.

What does nvcc do when compiling a .cu file?

It compiles the entire file with GCC, then converts the output to GPU machine code

It separates host and device code, compiles host code with GCC/Clang, compiles device code through PTX to SASS, and links them together

It interprets the CUDA code line-by-line at runtime, similar to Python

It converts the .cu file to OpenCL, which is then compiled by the GPU driver

Why are cuBLAS and cuDNN SASS binaries shipped pre-compiled rather than compiled at install time?

Because CUDA source code is proprietary and cannot be distributed

Because users' machines lack the necessary compiler toolchain

Because NVIDIA engineers hand-tune them for each GPU architecture, achieving performance that automated compilation cannot match

Because runtime compilation would exceed the GPU's available memory

What is Compute Capability and why does it matter for the software stack?

It measures a GPU's clock speed in GHz, which determines how fast kernels execute

It's NVIDIA's versioning scheme for GPU features — SASS is specific to a compute capability, so code compiled for one version may not run on another

It represents the maximum number of CUDA cores on a GPU, which limits parallelism

It's a software setting that can be changed to unlock hidden GPU features

What is the difference between Nsight Systems and Nsight Compute?

Nsight Systems profiles CPU code while Nsight Compute profiles GPU code

Nsight Systems is free while Nsight Compute requires a paid license

Nsight Systems traces the full system timeline (CPU, GPU, memory transfers) while Nsight Compute deep-dives into individual kernel performance metrics

Nsight Systems works only on Linux while Nsight Compute is cross-platform