Probability Distributions

Why Do We Need Probability in Deep Learning?

Machine learning is, at its core, the mathematics of uncertainty. A language model doesn't output a single next word — it outputs a probability distribution over every word in its vocabulary, and we sample from that distribution to generate text. A classifier doesn't declare "this is a cat" — it assigns probabilities like $[0.92, 0.05, 0.03]$ across classes, and we pick the most likely one. Even during training, the loss function that guides learning ( cross-entropy ) is defined entirely in terms of probability: it measures how well the model's predicted distribution matches the true one.

Hit Run on the code below to see what a classifier's output distribution looks like:

import json, js

categories = ["cat", "dog", "bird"]
probabilities = [0.92, 0.05, 0.03]

plot_data = [
    {
        "title": "Classifier output: a probability distribution over classes",
        "x_label": "Class",
        "y_label": "Probability",
        "x_data": categories,
        "lines": [
            {"label": "P(class)", "data": probabilities, "color": "#6366f1", "type": "bar"}
        ]
    }
]
js.window.py_plot_data = json.dumps(plot_data)

👋 Hey there! Please note this track covers the fundamentals, but it is not necessary to know it by heart to use a model, or even to train a model. You can read it, pick up what comes naturally, then move on to the fun tracks (e.g., building your own GPT). If at some point you need a refresher of some concept, this track will always be here for you!

Probability distributions aren't just outputs — they're woven into the fabric of how models are built and optimised. The softmax function converts raw logits into a valid probability distribution. Weight initialisation schemes (Glorot, He) sample initial parameter values from carefully chosen distributions. Diffusion models generate images by systematically adding and removing Gaussian noise. And QLoRA's NF4 quantization uses the inverse CDF of the normal distribution to place quantization levels where model weights are densest. Don't worry if these terms are unfamiliar — each one is covered in its own track. The point is that probability is everywhere, and this article gives you the foundations.

So what exactly is a probability distribution? How do we describe it mathematically? And how do we compute useful quantities from it — like the probability of a value falling in some range, or the value below which a certain fraction of outcomes lie? This article builds those foundations. We'll start with the two simplest and most important distributions — the uniform and the normal — then develop the core mathematical tools (the density function, the CDF, and the quantile function) that let us work with any distribution, and close with the mechanics of sampling.

The Uniform Distribution

The simplest probability distribution is the uniform distribution : every outcome is equally likely. This is the "I have no prior knowledge, so all possibilities are equally plausible" distribution.

A uniform distribution can be discrete or continuous . Rolling a fair die is discrete uniform: each of the six faces has probability $P(X = k) = \frac{1}{6}$. Picking a random card from a shuffled deck is another — each of 52 cards has probability $\frac{1}{52}$. The continuous version spreads equal probability density over a real-valued interval $[a, b]$: $f(x) = \frac{1}{b - a}$ inside the interval, $0$ outside. If you pick a random real number between 0 and 1 with no bias (i.e., no preference for any region over another), you're sampling from the standard uniform $U(0, 1)$, where $f(x) = 1$ throughout.

f(x) = \begin{cases} \frac{1}{b - a} & \text{if } a \leq x \leq b \\ 0 & \text{otherwise} \end{cases}

The mean is $\frac{a + b}{2}$ (the midpoint), and the variance is $\frac{(b - a)^2}{12}$ — wider intervals mean more spread.

💡 Where does the 12 come from? Take the standard uniform $U(0, 1)$: the mean is $\frac{1}{2}$, so $E[X^2] = \int_0^1 x^2 \cdot 1 \, dx = \frac{1}{3}$, and $\text{Var}(X) = E[X^2] - (E[X])^2 = \frac{1}{3} - \frac{1}{4} = \frac{1}{12}$. For a general $U(a, b)$, the interval width scales the variance by $(b - a)^2$, giving $\frac{(b-a)^2}{12}$.

Why does this matter for deep learning? Several reasons:

Weight initialisation baseline: before Glorot and He, weights were often drawn from simple uniform distributions. Even modern schemes (like PyTorch's kaiming_uniform_ ) use uniform distributions with carefully chosen bounds.
Random number generation: every call to torch.rand() or np.random.rand() draws from $U(0, 1)$. This is the foundation — as we'll see in the Sampling section, you can transform uniform random numbers into samples from any distribution.
Dropout: during training, dropout masks are generated by comparing $U(0, 1)$ samples against the dropout rate $p$. If the sample exceeds $p$, the neuron stays; otherwise it's zeroed.
Data shuffling and augmentation: random crop positions, colour jitter ranges, and batch shuffling all rely on uniform random variables.

Run the code below to see what a uniform distribution looks like. The curve is called a density curve — it shows how "densely packed" the probability is at each point. Taller means values are more likely to cluster there, flatter means probability is spread out. For a uniform distribution the density curve is flat, because no value is favoured. We plot two: the standard $U(0, 1)$ and the wider $U(-2, 3)$. Notice how the wider one has a lower curve — the probability is spread over a larger interval, but the total area under each curve is exactly 1.

import math, json, js

def uniform_pdf(x, a, b):
    if a <= x <= b:
        return 1.0 / (b - a)
    return 0.0

x_vals = [i * 0.05 for i in range(-60, 81)]  # -3.0 to 4.0

# U(0, 1)
pdf_01 = [uniform_pdf(x, 0, 1) for x in x_vals]

# U(-2, 3)
pdf_neg23 = [uniform_pdf(x, -2, 3) for x in x_vals]

plot_data = [
    {
        "title": "Uniform Density",
        "x_label": "x",
        "y_label": "f(x)",
        "x_data": x_vals,
        "lines": [
            {"label": "U(0, 1) density", "data": pdf_01, "color": "#3b82f6"},
            {"label": "U(-2, 3) density", "data": pdf_neg23, "color": "#f59e0b"}
        ]
    }
]
js.window.py_plot_data = json.dumps(plot_data)

📌 The uniform distribution has maximum entropy (a measure of uncertainty — the higher the entropy, the less predictable the outcome; see the entropy article) among all distributions on a bounded interval. In information theory terms, it's the distribution that makes the fewest assumptions about where values will fall — exactly why it's the default when you have no prior knowledge.

The Normal (Gaussian) Distribution

The normal distribution is the most important distribution in statistics and deep learning. Its bell-shaped curve appears everywhere: in the distribution of measurement errors, in the initial values of neural network weights, in the noise added by diffusion models, and in the empirical distribution of pre-trained model weights (which is why NF4 quantization works so well). Understanding its formula is essential because it's the building block for so many techniques you'll encounter.

The normal distribution is parameterised by two values: the mean $\mu$ and the standard deviation $\sigma$. Its density formula is:

f(x) = \frac{1}{\sigma\sqrt{2\pi}} \, e^{\displaystyle -\frac{(x - \mu)^2}{2\sigma^2}} = \frac{1}{\sigma\sqrt{2\pi}} \exp\!\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)

Both notations mean the same thing: $e^x = \exp(x)$, where $e \approx 2.718$ is Euler's number. You'll see both in textbooks and code — $e^{(\cdot)}$ is more visual, $\exp(\cdot)$ avoids tiny superscripts when the exponent gets long. The formula looks intimidating, but every piece has its purpose. Let's unpack it!

$\mu$ (mu) — the mean. This is the centre of the bell curve: the value where the peak sits. Shifting $\mu$ left or right slides the entire distribution along the $x$-axis without changing its shape. In deep learning, when we say weights are initialised from $\mathcal{N}(0, \sigma)$, the $\mu = 0$ means the initial weights are centred around zero.

$\sigma$ (sigma) — the standard deviation. This controls the width of the bell curve. A small $\sigma$ means values cluster tightly around the mean (a tall, narrow bell), while a large $\sigma$ means they spread out widely (a short, wide bell). The variance $\sigma^2$ appears in the formula as the natural unit of spread.

$\frac{1}{\sigma\sqrt{2\pi}}$ — the normalising constant. This factor ensures the total area under the curve is exactly 1. Without it, changing $\sigma$ would change the total area. When $\sigma$ is small, the bell is narrow but must still have area 1, so this factor makes it tall. When $\sigma$ is large, the bell is wide, so this factor makes it short. It's purely a bookkeeping term that normalises the total probability to 1.

$-\frac{(x - \mu)^2}{2\sigma^2}$ — the exponent. This is the heart of the formula. It's a squared distance from the mean, scaled by the variance, with a minus sign. $(x - \mu)^2$ measures how far $x$ is from the centre — it's always non-negative and symmetric (being 5 units above the mean is treated the same as 5 units below). Dividing by $2\sigma^2$ normalises this distance by the spread: a deviation of 1 in a distribution with $\sigma = 0.5$ is a big deal (exponent = $-2$), but in a distribution with $\sigma = 10$ it's negligible (exponent = $-0.005$). The minus sign means points far from $\mu$ get a very negative exponent, which drives $\exp(\cdot)$ toward zero.

Use the sliders below to see each effect in isolation. The first plot lets you change $\sigma$ (the width) while keeping the mean at zero. The second lets you shift $\mu$ (the centre) while keeping the width fixed at $\sigma = 1$. The dashed amber curve is the $\mathcal{N}(0, 1)$ reference in both:

import math, json, js

def norm_pdf(x, mu=0, sigma=1):
    coeff = 1.0 / (sigma * math.sqrt(2 * math.pi))
    exponent = -((x - mu) ** 2) / (2 * sigma ** 2)
    return coeff * math.exp(exponent)

x_vals = [i * 0.05 for i in range(-140, 141)]  # -7.0 to 7.0

sigma_values = [0.3, 0.5, 0.75, 1.0, 1.25, 1.5, 2.0, 2.5, 3.0]
default_sig = 3  # sigma = 1.0

# --- Plot 1: varying sigma, mu fixed at 0 ---
ref_y = [norm_pdf(x, 0, 1) for x in x_vals]
sig_traces = [{
    "x": x_vals, "y": ref_y, "mode": "lines",
    "name": "N(0, 1) reference",
    "line": {"color": "#f59e0b", "dash": "dash", "width": 2},
    "visible": True, "xaxis": "x", "yaxis": "y"
}]
for i, s in enumerate(sigma_values):
    y = [norm_pdf(x, 0, s) for x in x_vals]
    sig_traces.append({
        "x": x_vals, "y": y, "mode": "lines",
        "name": f"N(0, {s})",
        "line": {"color": "#6366f1", "width": 2.5},
        "visible": (i == default_sig), "xaxis": "x", "yaxis": "y"
    })

sig_steps = []
for i, s in enumerate(sigma_values):
    vis = [True] + [j == i for j in range(len(sigma_values))]
    sig_steps.append({
        "label": f"{s:.2g}", "method": "update",
        "args": [{"visible": vis + [None] * (len(mu_values) + 1)},
                 {"title": f"Changing width: N(0, {s})"}]
    })

# --- Plot 2: varying mu, sigma fixed at 1 ---
mu_values = [-3, -2, -1, 0, 1, 2, 3]
default_mu = 3  # mu = 0

mu_traces = [{
    "x": x_vals, "y": ref_y, "mode": "lines",
    "name": "N(0, 1) reference",
    "line": {"color": "#f59e0b", "dash": "dash", "width": 2},
    "visible": True, "xaxis": "x2", "yaxis": "y2"
}]
for i, mu in enumerate(mu_values):
    y = [norm_pdf(x, mu, 1) for x in x_vals]
    mu_traces.append({
        "x": x_vals, "y": y, "mode": "lines",
        "name": f"N({mu}, 1)",
        "line": {"color": "#6366f1", "width": 2.5},
        "visible": (i == default_mu), "xaxis": "x2", "yaxis": "y2"
    })

mu_steps = []
for i, mu in enumerate(mu_values):
    vis = [None] * (len(sigma_values) + 1) + [True] + [j == i for j in range(len(mu_values))]
    mu_steps.append({
        "label": str(mu), "method": "update",
        "args": [{"visible": vis},
                 {"title": f"Changing centre: N({mu}, 1)"}]
    })

all_traces = sig_traces + mu_traces

layout = {
    "title": f"Changing width: N(0, {sigma_values[default_sig]})",
    "grid": {"rows": 2, "columns": 1, "pattern": "independent", "roworder": "top to bottom"},
    "xaxis":  {"title": "x", "range": [-7, 7]},
    "yaxis":  {"title": "f(x)", "range": [0, 1.4]},
    "xaxis2": {"title": "x", "range": [-7, 7]},
    "yaxis2": {"title": "f(x)", "range": [0, 0.45]},
    "sliders": [
        {
            "active": default_sig,
            "currentvalue": {"prefix": "\u03c3 = "},
            "pad": {"t": 60, "b": 10},
            "y": 0.52, "yanchor": "top",
            "steps": sig_steps
        },
        {
            "active": default_mu,
            "currentvalue": {"prefix": "\u03bc = "},
            "pad": {"t": 60, "b": 10},
            "y": -0.05, "yanchor": "top",
            "steps": mu_steps
        }
    ],
    "height": 650,
    "showlegend": False,
    "margin": {"t": 60, "b": 60}
}

js.window.py_plotly_data = json.dumps({"data": all_traces, "layout": layout})

💡 Notice that $\mathcal{N}(0, 0.5)$ peaks at $f(0) \approx 0.8$ — well above 1 if $\sigma$ were even smaller. The density can exceed 1 and still be valid, because density is not probability. What matters is that the total area under the curve is exactly 1. We'll formalise this in the next section.

Now let's trace the boundary behaviour:

At $x = \mu$: the exponent is $-\frac{(\mu - \mu)^2}{2\sigma^2} = 0$, so $\exp(0) = 1$, and $f(\mu) = \frac{1}{\sigma\sqrt{2\pi}}$. This is the peak of the bell curve. Notice the peak height is inversely proportional to $\sigma$ — a tighter distribution has a taller peak.
As $|x - \mu| \to \infty$: the exponent goes to $-\infty$, so $f(x) \to 0$. The tails of the bell curve decay toward zero, but they never quite reach it — the normal distribution has infinite support (it's defined over all real numbers), though values far from the mean are astronomically unlikely.
As $\sigma \to 0$: the distribution becomes infinitely tall and infinitely narrow, concentrating all probability at a single point $\mu$. In the limit, this approaches a Dirac delta function — a "spike" of infinite height but zero width that still has area 1.
As $\sigma \to \infty$: the distribution flattens toward zero everywhere. The bell becomes so wide that density at any given point is negligible, approaching a uniform distribution over the entire real line.

A useful rule of thumb is the 68-95-99.7 rule : approximately 68% of values drawn from a normal distribution fall within $1\sigma$ of the mean, 95% within $2\sigma$, and 99.7% within $3\sigma$. This tells you that values more than $3\sigma$ from the mean are extremely rare (about 3 in 1,000).

The standard normal distribution is the special case with $\mu = 0$ and $\sigma = 1$, written $\mathcal{N}(0, 1)$. It serves as the reference distribution — any normal distribution $\mathcal{N}(\mu, \sigma^2)$ can be converted to it by the substitution $z = \frac{x - \mu}{\sigma}$, a process called standardisation .

Why does this distribution matter so much for deep learning?

Weight initialisation: schemes like Glorot (Xavier) and He/Kaiming sample initial weights from normal distributions with carefully chosen $\sigma$ to maintain stable gradient flow through deep networks.
Pre-trained weight distributions: empirically, the weights of trained neural networks follow approximately normal distributions. This observation is what makes NF4 quantization work — placing quantization levels at the quantiles of a normal distribution matches where the weights actually are.
Diffusion models: the forward process in diffusion-based image generation progressively adds Gaussian noise to an image until it becomes pure noise from $\mathcal{N}(0, I)$. The reverse process learns to denoise step by step.
The Central Limit Theorem: the sum of many independent random variables tends toward a normal distribution, regardless of the original distributions. Deep learning is full of sums — weighted sums of inputs, averages over batches, accumulated gradients — so normal distributions emerge naturally.

The Probability Density Function (PDF)

Before formalising density, we need a key distinction. A discrete distribution assigns specific probabilities to individual outcomes — $P(X = 3) = \frac{1}{6}$ for a fair die, or $P(\text{next token} = \text{"the"}) = 0.12$ for a language model. This assignment is called a probability mass function (PMF) . But the distributions we just introduced (uniform and normal) are continuous — their outcomes span an uncountable range of real values.

Here's the conceptual leap: for a continuous distribution, the probability that $X$ equals any exact value is zero. Not approximately zero — exactly zero. There are uncountably many possible real numbers, and if each had even a tiny positive probability, the total would exceed 1. So we can't use a PMF. Instead, we need a function that describes where values are dense without assigning probability to individual points. That function is the probability density function.

A probability density function (PDF) , written $f(x)$, describes the relative likelihood of a continuous random variable taking on a given value. The key word is density — just as mass density tells you how much mass is packed into a unit of volume, probability density tells you how much probability is packed into a unit of $x$. High density at a point means values are likely to cluster near it. Low density means they're unlikely to be found there.

But here's the critical insight that trips up many students: $f(x)$ is not a probability. You cannot say "the probability of $X = 3$ is $f(3)$." Probabilities come from areas under the curve . To get an actual probability, you must integrate the PDF over an interval:

P(a \leq X \leq b) = \int_a^b f(x) \, dx

Let's break this apart.

$P(a \leq X \leq b)$ is the probability that the random variable $X$ takes a value between $a$ and $b$. This is a proper probability — a number between 0 and 1.

$\int_a^b f(x) \, dx$ is the area under the curve $f(x)$ between $a$ and $b$. Geometrically, you're shading the region between the PDF curve and the $x$-axis from $a$ to $b$, and computing its area. The wider the interval or the taller the PDF in that region, the larger the probability.

Now let's trace the boundary behaviour. As the interval shrinks to a single point ($a = b$), the width of the region goes to zero, and so the area goes to zero: $\int_a^a f(x) \, dx = 0$. This confirms what we said earlier — $P(X = \text{exactly } c) = 0$ for any continuous distribution. No matter how tall the PDF is at a point, a region of zero width has zero area. On the other extreme, integrating over the entire real line must give 1, because the total probability of the random variable taking some value is certain:

\int_{-\infty}^{\infty} f(x) \, dx = 1

This leads to the two properties that every valid PDF must satisfy:

$f(x) \geq 0$ for all $x$ — density cannot be negative. Negative probability doesn't make sense.
$\int_{-\infty}^{\infty} f(x) \, dx = 1$ — the total probability across all possible values must equal 1. Something must happen.

📌 $f(x)$ CAN be greater than 1. This surprises many people, but remember: $f(x)$ is a density, not a probability. A very narrow, peaked distribution can have $f(x) > 1$ near its peak, as long as the total area under the curve is still exactly 1. For example, a uniform distribution on $[0, 0.5]$ has $f(x) = 2$ throughout its support — a density of 2, but the total area is $2 \times 0.5 = 1$. Perfectly valid.

The plot below shows the PDF of the standard normal distribution (the bell curve). The shaded region between $x = -1$ and $x = 1$ represents $P(-1 \leq X \leq 1)$ — about 68% of the total area. Notice how the density peaks at $x = 0$ and falls off symmetrically in both directions.

import math, json, js

def norm_pdf(x, mu=0, sigma=1):
    """PDF of the normal distribution."""
    coeff = 1.0 / (sigma * math.sqrt(2 * math.pi))
    exponent = -((x - mu) ** 2) / (2 * sigma ** 2)
    return coeff * math.exp(exponent)

# Generate x values from -4 to 4
x_vals = [i * 0.05 for i in range(-80, 81)]  # -4.0 to 4.0 in steps of 0.05

# Full PDF curve
y_full = [norm_pdf(x) for x in x_vals]

# Shaded region: PDF between -1 and 1, zero elsewhere
y_shaded = [norm_pdf(x) if -1 <= x <= 1 else 0 for x in x_vals]

plot_data = [
    {
        "title": "Standard Normal PDF with P(-1 ≤ X ≤ 1) ≈ 0.683 shaded",
        "x_label": "x",
        "y_label": "f(x)",
        "x_data": x_vals,
        "lines": [
            {"label": "N(0,1) PDF", "data": y_full, "color": "#3b82f6"},
            {"label": "P(-1 ≤ X ≤ 1)", "data": y_shaded, "color": "#93c5fd", "fill": True}
        ]
    }
]
js.window.py_plot_data = json.dumps(plot_data)

The Cumulative Distribution Function (CDF)

The PDF tells us where probability is dense, but often we need a different question answered: what is the probability that $X$ is less than or equal to some value $x$? For instance, what fraction of neural network weights fall below 0.5? What's the probability that a random sample from $\mathcal{N}(0, 1)$ is negative? The cumulative distribution function (CDF) answers exactly this.

F(x) = P(X \leq x) = \int_{-\infty}^{x} f(t) \, dt

Let's break this apart.

$F(x)$ is the CDF evaluated at point $x$. It gives the total probability accumulated from $-\infty$ up to $x$ — the area under the PDF curve to the left of $x$.

$\int_{-\infty}^{x} f(t) \, dt$ is the integral of the PDF from negative infinity up to $x$. We use $t$ as the dummy variable of integration (instead of $x$) to distinguish the variable being integrated over from the fixed upper limit $x$. As $x$ increases, we accumulate more area, so $F(x)$ grows — that's why it's called the "cumulative" distribution function.

The CDF has several important properties:

Monotonically non-decreasing: $F(x)$ can only go up or stay flat, never decrease. Once probability has been accumulated, it can't be un-accumulated.
$\lim_{x \to -\infty} F(x) = 0$: as we move infinitely far to the left, no probability has been accumulated yet.
$\lim_{x \to \infty} F(x) = 1$: as we move infinitely far to the right, all probability has been accumulated.
$F(x) \in [0, 1]$: the CDF is always a valid probability.

For the standard normal distribution $\mathcal{N}(0, 1)$, the CDF is traditionally denoted $\Phi(x)$:

\Phi(x) = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{x} e^{-t^2/2} \, dt

This integral has no closed-form solution — there is no combination of elementary functions that equals $\Phi(x)$. We must compute it numerically. Fortunately, $\Phi(x)$ can be expressed using the error function $\text{erf}(x)$, which is available in most programming languages:

\Phi(x) = \frac{1}{2}\left[1 + \text{erf}\!\left(\frac{x}{\sqrt{2}}\right)\right]

Some key values worth memorising:

$\Phi(0) = 0.5$ — exactly half the distribution lies below the mean. By symmetry, exactly half lies above it.
$\Phi(1.96) \approx 0.975$ — 97.5% of the distribution lies below $x = 1.96$. This is the basis of the famous 95% confidence interval: the central 95% of the distribution lies between $-1.96$ and $+1.96$.
$\Phi(-1.96) \approx 0.025$ — only 2.5% of the distribution lies below $x = -1.96$, the mirror image of the previous value.

The plot below shows $\Phi(x)$ — the characteristic S-shaped (sigmoid-like) curve. It starts near 0 on the far left, passes through 0.5 at $x = 0$, and approaches 1 on the far right. The steepest part of the curve is at $x = 0$, where the PDF (the derivative of the CDF) is highest.

import math, json, js

def norm_cdf(x):
    """CDF of the standard normal distribution using math.erf."""
    return 0.5 * (1.0 + math.erf(x / math.sqrt(2.0)))

x_vals = [i * 0.05 for i in range(-80, 81)]  # -4.0 to 4.0
y_cdf = [norm_cdf(x) for x in x_vals]

# Horizontal reference lines at key probabilities
y_half = [0.5 for _ in x_vals]
y_025 = [0.025 for _ in x_vals]
y_975 = [0.975 for _ in x_vals]

plot_data = [
    {
        "title": "Standard Normal CDF Φ(x)",
        "x_label": "x",
        "y_label": "Φ(x)",
        "x_data": x_vals,
        "lines": [
            {"label": "Φ(x)", "data": y_cdf, "color": "#8b5cf6"},
            {"label": "Φ = 0.5 (at x = 0)", "data": y_half, "color": "#9ca3af"},
            {"label": "Φ = 0.025 (at x ≈ -1.96)", "data": y_025, "color": "#f87171"},
            {"label": "Φ = 0.975 (at x ≈ 1.96)", "data": y_975, "color": "#f87171"}
        ]
    }
]
js.window.py_plot_data = json.dumps(plot_data)

The following code computes the CDF at several important $x$ values and displays them as a table. These values appear repeatedly in statistics and machine learning — they're worth having intuition for.

import math, json, js

def norm_cdf(x):
    return 0.5 * (1.0 + math.erf(x / math.sqrt(2.0)))

x_values = [-3.0, -2.0, -1.96, -1.0, 0.0, 1.0, 1.96, 2.0, 3.0]

rows = []
for x in x_values:
    cdf_val = norm_cdf(x)
    interpretation = ""
    if x == 0.0:
        interpretation = "Mean: exactly half below"
    elif abs(x) == 1.0:
        interpretation = f"{'Below' if x < 0 else 'Above'} 1σ"
    elif abs(x - 1.96) < 0.01 or abs(x + 1.96) < 0.01:
        interpretation = "95% confidence boundary"
    elif abs(x) == 2.0:
        interpretation = f"{'Below' if x < 0 else 'Above'} 2σ"
    elif abs(x) == 3.0:
        interpretation = f"{'Below' if x < 0 else 'Above'} 3σ"
    rows.append([f"{x:+.2f}", f"{cdf_val:.6f}", interpretation])

js.window.py_table_data = json.dumps({
    "headers": ["x", "Φ(x)", "Interpretation"],
    "rows": rows
})

print("Key insight: Φ(x) + Φ(-x) = 1 for all x (by symmetry)")
print(f"Verify: Φ(1.96) + Φ(-1.96) = {norm_cdf(1.96):.6f} + {norm_cdf(-1.96):.6f} = {norm_cdf(1.96) + norm_cdf(-1.96):.6f}")

💡 The CDF gives us probabilities for ranges too. Since $P(a \leq X \leq b) = \Phi(b) - \Phi(a)$, we can verify the 68-95-99.7 rule: $\Phi(1) - \Phi(-1) \approx 0.8413 - 0.1587 = 0.6827$, or about 68.3% within $\pm 1\sigma$. For $\pm 2\sigma$: $\Phi(2) - \Phi(-2) \approx 0.9772 - 0.0228 = 0.9545$, or 95.4%.

The Quantile Function (Inverse CDF)

The CDF answers "what fraction of the distribution lies below $x$?" But sometimes we need the reverse question: "at what value of $x$ does a given fraction $p$ of the distribution lie below it?" For example, what value separates the bottom 97.5% from the top 2.5%? This is what the quantile function (also called the inverse CDF or percent-point function ) provides.

For the standard normal, the quantile function is written $\Phi^{-1}(p)$. Given a probability $p \in (0, 1)$, it returns the $z$-score where $\Phi(z) = p$ — the point where exactly $p$ fraction of the distribution lies to its left.

Some key examples:

$\Phi^{-1}(0.5) = 0$ — the median. Half the distribution lies below 0, which makes sense because the standard normal is symmetric about 0.
$\Phi^{-1}(0.975) \approx 1.96$ — the 97.5th percentile. This is the value above which only 2.5% of the distribution lies. It's the right boundary of the 95% confidence interval.
$\Phi^{-1}(0.025) \approx -1.96$ — the 2.5th percentile. Only 2.5% of the distribution lies below this value. It's the left boundary of the 95% confidence interval.

Now for the boundary behaviour:

$\Phi^{-1}(0) = -\infty$: no finite value captures 0% of the distribution. You'd have to go infinitely far left to have zero accumulated probability.
$\Phi^{-1}(1) = +\infty$: you'd need to include the entire infinite real line to capture 100% of the distribution.
$\Phi^{-1}(0.5) = 0$: the standard normal is symmetric about 0, so the median equals the mean.

This function has a direct and important application in deep learning. QLoRA 's NF4 quantization uses the quantile function to design a 4-bit data type optimised for normally-distributed weights. The idea: divide the probability range $[0, 1]$ into $2^b = 16$ equal-probability bins and place each quantization level at the midpoint quantile of its bin:

q_i = \Phi^{-1}\!\left(\frac{2i + 1}{2 \cdot 2^b}\right)

This places levels densely near $\mu = 0$ (where most weights cluster) and sparsely in the tails (where few weights exist), minimising quantization error under the assumption that weights are approximately normally distributed.

The plot below shows the quantile function — an inverse S-curve. It's a mirror image of the CDF: where the CDF maps $x$-values to probabilities, the quantile function maps probabilities back to $x$-values. Notice how it's steep near $p = 0$ and $p = 1$ (a small change in probability corresponds to a large change in $x$ in the tails) and relatively flat near $p = 0.5$ (probabilities accumulate quickly near the centre).

import math, json, js

def norm_ppf(p):
    """Approximate inverse CDF of standard normal (Abramowitz & Stegun)."""
    if p <= 0:
        return -4.0
    if p >= 1:
        return 4.0
    if p > 0.5:
        return -norm_ppf(1.0 - p)
    t = math.sqrt(-2.0 * math.log(p))
    c0, c1, c2 = 2.515517, 0.802853, 0.010328
    d1, d2, d3 = 1.432788, 0.189269, 0.001308
    return -(t - (c0 + c1*t + c2*t*t) / (1 + d1*t + d2*t*t + d3*t*t*t))

# Probabilities from 0.01 to 0.99
p_vals = [i * 0.005 for i in range(2, 199)]  # 0.01 to 0.99
z_vals = [norm_ppf(p) for p in p_vals]

plot_data = [
    {
        "title": "Standard Normal Quantile Function Φ⁻¹(p)",
        "x_label": "Probability p",
        "y_label": "z-score (Φ⁻¹(p))",
        "x_data": p_vals,
        "lines": [
            {"label": "Φ⁻¹(p)", "data": z_vals, "color": "#ef4444"}
        ]
    }
]
js.window.py_plot_data = json.dumps(plot_data)

The following code computes quantile values at key probabilities and shows how NF4 uses them to place its 16 quantization levels:

import math, json, js

def norm_ppf(p):
    """Approximate inverse CDF of standard normal (Abramowitz & Stegun)."""
    if p <= 0: return -4.0
    if p >= 1: return 4.0
    if p > 0.5:
        return -norm_ppf(1.0 - p)
    t = math.sqrt(-2.0 * math.log(p))
    c0, c1, c2 = 2.515517, 0.802853, 0.010328
    d1, d2, d3 = 1.432788, 0.189269, 0.001308
    return -(t - (c0 + c1*t + c2*t*t) / (1 + d1*t + d2*t*t + d3*t*t*t))

# Key quantile values
rows_key = []
for p in [0.025, 0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95, 0.975]:
    z = norm_ppf(p)
    rows_key.append([f"{p:.3f}", f"{z:+.4f}", f"{p*100:.1f}th percentile"])

# NF4 levels
rows_nf4 = []
for i in range(16):
    p = (2 * i + 1) / 32
    z = norm_ppf(p)
    rows_nf4.append([str(i), f"{p:.5f}", f"{z:+.4f}"])

# Combine into two tables displayed sequentially
js.window.py_table_data = json.dumps({
    "headers": ["Probability p", "Φ⁻¹(p)", "Meaning"],
    "rows": rows_key
})

print("\nNF4 quantization levels (16 levels for 4-bit):")
print(f"{'Level':>5} {'Prob p':>10} {'Φ⁻¹(p)':>10}")
print("-" * 28)
for row in rows_nf4:
    print(f"{row[0]:>5} {row[1]:>10} {row[2]:>10}")

💡 The quantile function is the mathematical bridge between "equal probability bins" and "where to place quantization levels." Because the normal distribution concentrates its mass near the mean, equal-probability bins translate to tightly-spaced levels near zero and widely-spaced levels in the tails — exactly the design that minimises quantization error for normally-distributed weights.

Sampling from Distributions

Every neural network training run starts by sampling: initial weights from a normal distribution, dropout masks from uniform Bernoulli trials, mini-batches shuffled randomly. Diffusion models sample noise at every step. Language models sample tokens at every step. But what does "sampling" actually mean, and how does a computer — which can only generate deterministic outputs — produce random numbers from a specific distribution?

The foundation is the uniform random number generator . Hardware or algorithmic sources produce values from $U(0, 1)$ — numbers that are equally likely to fall anywhere between 0 and 1. In PyTorch, this is torch.rand() . Everything else is built on top of this.

To sample from any other distribution, we use the inverse transform method : generate $u \sim U(0, 1)$, then compute $x = F^{-1}(u)$, where $F^{-1}$ is the quantile function of the target distribution. The result $x$ follows that target distribution exactly.

u \sim U(0, 1), \quad x = F^{-1}(u) \implies x \sim F

Why does this work? The CDF $F$ maps values to probabilities uniformly distributed in $[0, 1]$. Inverting it maps uniform probabilities back to values distributed according to $F$. Regions where the CDF is steep (high density) consume a wide range of $u$ values, so they produce more samples. Regions where it's flat (low density) consume a narrow range, producing fewer samples.

For the normal distribution specifically, torch.randn() produces samples from $\mathcal{N}(0, 1)$ using optimised algorithms (typically the Box-Muller transform or Ziggurat method, which are more efficient than direct inverse CDF computation). To get samples from $\mathcal{N}(\mu, \sigma^2)$:

x = \mu + \sigma \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, 1)

This is the reparameterization trick — expressing a sample from an arbitrary normal as a deterministic transformation of a standard normal sample. This seemingly simple identity is crucial for training. In VAEs and diffusion models, we need gradients to flow through the sampling operation. You can't backpropagate through "draw a random sample", but you can backpropagate through the deterministic function $\mu + \sigma \cdot \epsilon$ because the randomness ($\epsilon$) is external to the parameters ($\mu$, $\sigma$).

The plot below demonstrates inverse transform sampling visually. We generate 1000 samples from $U(0, 1)$, transform them through the normal inverse CDF $\Phi^{-1}$, and show that the resulting histogram matches the bell curve.

import math, json, js

def norm_ppf(p):
    """Approximate inverse CDF of standard normal (Abramowitz & Stegun)."""
    if p <= 0.0001: return -3.5
    if p >= 0.9999: return 3.5
    if p > 0.5:
        return -norm_ppf(1.0 - p)
    t = math.sqrt(-2.0 * math.log(p))
    c0, c1, c2 = 2.515517, 0.802853, 0.010328
    d1, d2, d3 = 1.432788, 0.189269, 0.001308
    return -(t - (c0 + c1*t + c2*t*t) / (1 + d1*t + d2*t*t + d3*t*t*t))

def norm_pdf(x, mu=0, sigma=1):
    coeff = 1.0 / (sigma * math.sqrt(2 * math.pi))
    exponent = -((x - mu) ** 2) / (2 * sigma ** 2)
    return coeff * math.exp(exponent)

# Simple LCG random number generator for reproducibility
seed = 42
def lcg_rand():
    global seed
    seed = (1103515245 * seed + 12345) & 0x7fffffff
    return seed / 0x7fffffff

# Generate 1000 U(0,1) samples
n = 1000
uniform_samples = [lcg_rand() for _ in range(n)]

# Transform through inverse CDF
normal_samples = [norm_ppf(u) for u in uniform_samples]

# Histogram for uniform samples (20 bins from 0 to 1)
u_bins = 20
u_counts = [0] * u_bins
for u in uniform_samples:
    idx = min(int(u * u_bins), u_bins - 1)
    u_counts[idx] += 1
u_bin_centers = [(i + 0.5) / u_bins for i in range(u_bins)]
u_density = [c / (n * (1.0 / u_bins)) for c in u_counts]

# Histogram for normal samples (30 bins from -4 to 4)
n_bins = 30
n_min, n_max = -4.0, 4.0
bin_width = (n_max - n_min) / n_bins
n_counts = [0] * n_bins
for x in normal_samples:
    idx = int((x - n_min) / bin_width)
    if 0 <= idx < n_bins:
        n_counts[idx] += 1
n_bin_centers = [n_min + (i + 0.5) * bin_width for i in range(n_bins)]
n_density = [c / (n * bin_width) for c in n_counts]

# True normal PDF for overlay
x_smooth = [i * 0.05 for i in range(-80, 81)]
y_true = [norm_pdf(x) for x in x_smooth]

plot_data = [
    {
        "title": "Step 1: Uniform samples U(0, 1)",
        "x_label": "u",
        "y_label": "Density",
        "x_data": u_bin_centers,
        "lines": [
            {"label": "Uniform samples", "data": u_density, "color": "#6366f1", "type": "bar"}
        ]
    },
    {
        "title": "Step 2: After inverse CDF transform — matches N(0, 1)",
        "x_label": "x",
        "y_label": "Density",
        "x_data": n_bin_centers,
        "lines": [
            {"label": "Transformed samples", "data": n_density, "color": "#6366f1", "type": "bar"},
            {"label": "True N(0,1) PDF", "data": [norm_pdf(x) for x in n_bin_centers], "color": "#ef4444"}
        ]
    }
]
js.window.py_plot_data = json.dumps(plot_data)

💡 Weight initialisation ties all of this together. $\texttt{torch.nn.init.kaiming\_normal\_}$ samples from $\mathcal{N}(0, \sigma^2)$ where $\sigma = \sqrt{2/n_{\text{in}}}$. Under the hood: generate $\epsilon \sim \mathcal{N}(0,1)$, multiply by $\sqrt{2/n_{\text{in}}}$. The initial weights are scaled so that variance is preserved across layers — too small and signals vanish, too large and they explode. The choice of normal vs uniform matters less than getting the scale right.

Connecting It All Together

The PDF, CDF, and quantile function form a chain, each derived from the previous:

PDF $f(x)$: "How dense is probability at this point?" This is the starting point. It tells you where values are likely to cluster, but its values are densities, not probabilities.
CDF $F(x) = \int_{-\infty}^{x} f(t) \, dt$: "How much total probability lies to the left of this point?" Obtained by integrating the PDF. It converts density information into cumulative probabilities.
Quantile $F^{-1}(p)$: "At what point has this much probability accumulated?" Obtained by inverting the CDF. It converts probabilities back into values on the original scale.

These three functions are different views of the same underlying distribution. The CDF is the integral of the PDF. The PDF is the derivative of the CDF. The quantile function is the functional inverse of the CDF. Knowing any one of them fully determines the other two.

The plot below shows all three functions for the standard normal distribution on a single chart. The blue PDF peaks at $x = 0$ with density $\approx 0.399$. The purple CDF passes through 0.5 at $x = 0$ (half the probability accumulated). And the red quantile function, plotted on a different conceptual axis, shows the inverse mapping.

import math, json, js

def norm_pdf(x, mu=0, sigma=1):
    coeff = 1.0 / (sigma * math.sqrt(2 * math.pi))
    exponent = -((x - mu) ** 2) / (2 * sigma ** 2)
    return coeff * math.exp(exponent)

def norm_cdf(x):
    return 0.5 * (1.0 + math.erf(x / math.sqrt(2.0)))

def norm_ppf(p):
    if p <= 0: return -4.0
    if p >= 1: return 4.0
    if p > 0.5:
        return -norm_ppf(1.0 - p)
    t = math.sqrt(-2.0 * math.log(p))
    c0, c1, c2 = 2.515517, 0.802853, 0.010328
    d1, d2, d3 = 1.432788, 0.189269, 0.001308
    return -(t - (c0 + c1*t + c2*t*t) / (1 + d1*t + d2*t*t + d3*t*t*t))

x_vals = [i * 0.05 for i in range(-80, 81)]  # -4.0 to 4.0

y_pdf = [norm_pdf(x) for x in x_vals]
y_cdf = [norm_cdf(x) for x in x_vals]

# For the quantile function, we plot it on the same x-axis
# by treating x as a probability input (only valid for 0 < x < 1 region)
# Instead, we plot the inverse: for each x, show ppf(cdf(x)) = x (identity check)
# Better: show all three as separate sub-plots

# We use two charts: one for PDF + CDF (same x-axis), one for quantile function

plot_data = [
    {
        "title": "PDF and CDF of the Standard Normal Distribution",
        "x_label": "x",
        "y_label": "Value",
        "x_data": x_vals,
        "lines": [
            {"label": "PDF f(x)", "data": y_pdf, "color": "#3b82f6"},
            {"label": "CDF Φ(x)", "data": y_cdf, "color": "#8b5cf6"}
        ]
    },
    {
        "title": "Quantile Function Φ⁻¹(p)",
        "x_label": "Probability p",
        "y_label": "z-score",
        "x_data": [i * 0.005 for i in range(2, 199)],
        "lines": [
            {"label": "Φ⁻¹(p)", "data": [norm_ppf(i * 0.005) for i in range(2, 199)], "color": "#ef4444"}
        ]
    }
]
js.window.py_plot_data = json.dumps(plot_data)

Run the code below to see the correspondence between all three functions at key points, making the chain of relationships concrete:

import math, json, js

def norm_pdf(x, mu=0, sigma=1):
    coeff = 1.0 / (sigma * math.sqrt(2 * math.pi))
    exponent = -((x - mu) ** 2) / (2 * sigma ** 2)
    return coeff * math.exp(exponent)

def norm_cdf(x):
    return 0.5 * (1.0 + math.erf(x / math.sqrt(2.0)))

def norm_ppf(p):
    if p <= 0: return -4.0
    if p >= 1: return 4.0
    if p > 0.5:
        return -norm_ppf(1.0 - p)
    t = math.sqrt(-2.0 * math.log(p))
    c0, c1, c2 = 2.515517, 0.802853, 0.010328
    d1, d2, d3 = 1.432788, 0.189269, 0.001308
    return -(t - (c0 + c1*t + c2*t*t) / (1 + d1*t + d2*t*t + d3*t*t*t))

x_values = [-3.0, -1.96, -1.0, 0.0, 1.0, 1.96, 3.0]
rows = []
for x in x_values:
    f_x = norm_pdf(x)
    phi_x = norm_cdf(x)
    # Verify round-trip: ppf(cdf(x)) should equal x
    roundtrip = norm_ppf(phi_x)
    rows.append([
        f"{x:+.2f}",
        f"{f_x:.4f}",
        f"{phi_x:.4f}",
        f"{roundtrip:+.4f}"
    ])

js.window.py_table_data = json.dumps({
    "headers": ["x", "PDF f(x)", "CDF Φ(x)", "Φ⁻¹(Φ(x)) ≈ x"],
    "rows": rows
})

print("The last column verifies the round-trip: applying Φ⁻¹ to Φ(x) recovers x.")
print("Small discrepancies are due to the rational approximation used for Φ⁻¹.")

💡 The round-trip property $\Phi^{-1}(\Phi(x)) = x$ is what makes these functions inverses. In practice, the rational approximation for $\Phi^{-1}$ introduces tiny errors (on the order of $10^{-4}$), but for applications like NF4 quantization these errors are negligible — the quantization levels don't need to be exact to sub-thousandth precision.

Quiz

Test your understanding of probability distributions, PDFs, CDFs, and the normal distribution.

If a probability density function has f(x) = 2.5 at some point x, is this a valid PDF?

No — probabilities cannot exceed 1, so f(x) > 1 is invalid

Yes — f(x) is a density, not a probability, and can exceed 1 as long as the total area under the curve equals 1

Only if x is negative

It depends on the value of the CDF at that point

What is Φ(0) for the standard normal distribution N(0, 1)?

0 — the PDF is at its peak, so the CDF should be at its maximum

0.399 — the height of the PDF at x = 0

0.5 — exactly half the distribution lies below the mean

1.0 — the total area under the curve

What does Φ⁻¹(0.975) approximately equal for the standard normal?

0.975

0.5

2.58

1.96

Why does QLoRA use the quantile function Φ⁻¹ to compute NF4 quantization levels?

Because the quantile function is faster to compute than the CDF

To place quantization levels at equal-probability intervals of the normal distribution, concentrating levels where weights are densest

Because the inverse CDF always produces integer outputs suitable for quantization

To ensure the quantization levels are uniformly spaced

A uniform distribution U(2, 8) has f(x) = 1/6 for x in [2, 8]. What is P(3 ≤ X ≤ 5)?

1/6 — the density value

2/6 = 1/3 — the width of the interval times the density

1/2 — half the distribution

2/8 = 1/4 — the ratio of the interval width to the upper bound

The inverse transform method generates u ~ U(0, 1) then computes x = F⁻¹(u). Why do regions of high density in the target distribution receive more samples?

Because the random number generator produces more values in those regions

Because F⁻¹ has larger output values where density is high

Because the CDF is steep where density is high, so a wide range of u values maps to a narrow range of x values, concentrating samples there

Because the PDF is integrated before sampling, which amplifies high-density regions