Why Do We Need Probability in Deep Learning?

Machine learning is, at its core, the mathematics of uncertainty. A language model doesn't output a single next word — it outputs a probability distribution over every word in its vocabulary, and we sample from that distribution to generate text. A classifier doesn't declare "this is a cat" — it assigns probabilities like $[0.92, 0.05, 0.03]$ across classes, and we pick the most likely one. Even during training, the loss function that guides learning (cross-entropy) is defined entirely in terms of probability: it measures how well the model's predicted distribution matches the true one.

Probability distributions aren't just outputs — they're woven into the fabric of how models are built and optimised. The softmax function converts raw logits into a valid probability distribution. Weight initialisation schemes (Glorot, He) sample initial parameter values from carefully chosen distributions. Diffusion models generate images by systematically adding and removing Gaussian noise. And QLoRA 's NF4 quantization uses the inverse CDF of the normal distribution to place quantization levels where model weights are densest.

So what exactly is a probability distribution? How do we describe one mathematically? And how do we compute useful quantities from it — like the probability of a value falling in some range, or the value below which a certain fraction of outcomes lie? This article builds those foundations. We'll start with the distinction between discrete and continuous distributions, develop the core tools (PDF, CDF, and the quantile function), and focus in depth on the normal distribution — the single most important distribution in deep learning.

Discrete vs Continuous Distributions

Before diving into formulas, we need to distinguish two fundamentally different kinds of random variables, because they require different mathematical tools.

A discrete distribution has a finite (or countably infinite) set of possible outcomes. Rolling a die gives one of $\{1, 2, 3, 4, 5, 6\}$. Next-token prediction in a language model selects from a vocabulary of, say, 50,000 tokens. In these cases, we can assign a specific probability to each outcome: $P(X = 3) = \frac{1}{6}$ for a fair die, or $P(\text{next token} = \text{"the"}) = 0.12$ for a language model. This assignment of probabilities to individual outcomes is called a probability mass function (PMF) .

A continuous distribution has outcomes over a real-valued range — an uncountable infinity of possible values. The weight values in a neural network are continuous. The noise added during diffusion is continuous. A temperature forecast is continuous. Here's where things get conceptually tricky: what is the probability that a weight is exactly $0.314159265...$, with infinitely many decimal places matching? It's zero. Not approximately zero — exactly zero. There are uncountably many possible real numbers, and if each one had even a tiny positive probability, the total would exceed 1 (violating the axiom that all probabilities must sum to 1). So $P(X = \text{any specific value}) = 0$ for continuous random variables.

This means we can't use a PMF for continuous distributions. We need a different tool: one that tells us where values are likely to cluster without assigning probability to any single point. That tool is the probability density function.

The Probability Density Function (PDF)

A probability density function (PDF) , written $f(x)$, describes the relative likelihood of a continuous random variable taking on a given value. The key word is density — just as mass density tells you how much mass is packed into a unit of volume, probability density tells you how much probability is packed into a unit of $x$. High density at a point means values are likely to cluster near it. Low density means they're unlikely to be found there.

But here's the critical insight that trips up many students: $f(x)$ is not a probability. You cannot say "the probability of $X = 3$ is $f(3)$." Probabilities come from areas under the curve . To get an actual probability, you must integrate the PDF over an interval:

$$P(a \leq X \leq b) = \int_a^b f(x) \, dx$$

Let's break this apart.

$P(a \leq X \leq b)$ is the probability that the random variable $X$ takes a value between $a$ and $b$. This is a proper probability — a number between 0 and 1.

$\int_a^b f(x) \, dx$ is the area under the curve $f(x)$ between $a$ and $b$. Geometrically, you're shading the region between the PDF curve and the $x$-axis from $a$ to $b$, and computing its area. The wider the interval or the taller the PDF in that region, the larger the probability.

Now let's trace the boundary behaviour. As the interval shrinks to a single point ($a = b$), the width of the region goes to zero, and so the area goes to zero: $\int_a^a f(x) \, dx = 0$. This confirms what we said earlier — $P(X = \text{exactly } c) = 0$ for any continuous distribution. No matter how tall the PDF is at a point, a region of zero width has zero area. On the other extreme, integrating over the entire real line must give 1, because the total probability of the random variable taking some value is certain:

$$\int_{-\infty}^{\infty} f(x) \, dx = 1$$

This leads to the two properties that every valid PDF must satisfy:

  • $f(x) \geq 0$ for all $x$ — density cannot be negative. Negative probability doesn't make sense.
  • $\int_{-\infty}^{\infty} f(x) \, dx = 1$ — the total probability across all possible values must equal 1. Something must happen.
📌 $f(x)$ CAN be greater than 1. This surprises many people, but remember: $f(x)$ is a density, not a probability. A very narrow, peaked distribution can have $f(x) > 1$ near its peak, as long as the total area under the curve is still exactly 1. For example, a uniform distribution on $[0, 0.5]$ has $f(x) = 2$ throughout its support — a density of 2, but the total area is $2 \times 0.5 = 1$. Perfectly valid.

The plot below shows the PDF of the standard normal distribution (the bell curve). The shaded region between $x = -1$ and $x = 1$ represents $P(-1 \leq X \leq 1)$ — about 68% of the total area. Notice how the density peaks at $x = 0$ and falls off symmetrically in both directions.

import math, json, js

def norm_pdf(x, mu=0, sigma=1):
    """PDF of the normal distribution."""
    coeff = 1.0 / (sigma * math.sqrt(2 * math.pi))
    exponent = -((x - mu) ** 2) / (2 * sigma ** 2)
    return coeff * math.exp(exponent)

# Generate x values from -4 to 4
x_vals = [i * 0.05 for i in range(-80, 81)]  # -4.0 to 4.0 in steps of 0.05

# Full PDF curve
y_full = [norm_pdf(x) for x in x_vals]

# Shaded region: PDF between -1 and 1, zero elsewhere
y_shaded = [norm_pdf(x) if -1 <= x <= 1 else 0 for x in x_vals]

plot_data = [
    {
        "title": "Standard Normal PDF with P(-1 ≤ X ≤ 1) ≈ 0.683 shaded",
        "x_label": "x",
        "y_label": "f(x)",
        "x_data": x_vals,
        "lines": [
            {"label": "N(0,1) PDF", "data": y_full, "color": "#3b82f6"},
            {"label": "P(-1 ≤ X ≤ 1)", "data": y_shaded, "color": "#93c5fd", "fill": True}
        ]
    }
]
js.window.py_plot_data = json.dumps(plot_data)

The Normal (Gaussian) Distribution

The normal distribution is the most important distribution in statistics and deep learning. Its bell-shaped curve appears everywhere: in the distribution of measurement errors, in the initial values of neural network weights, in the noise added by diffusion models, and in the empirical distribution of pre-trained model weights (which is why NF4 quantization works so well). Understanding its formula is essential because it's the building block for so many techniques you'll encounter.

The normal distribution is parameterised by two values: the mean $\mu$ and the standard deviation $\sigma$. Its PDF is:

$$f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)$$

This formula looks intimidating, but every piece has a clear purpose. Let's unpack them one by one.

$\mu$ (mu) — the mean. This is the centre of the bell curve: the value where the peak sits. Shifting $\mu$ left or right slides the entire distribution along the $x$-axis without changing its shape. In deep learning, when we say weights are initialised from $\mathcal{N}(0, \sigma)$, the $\mu = 0$ means the initial weights are centred around zero.

$\sigma$ (sigma) — the standard deviation. This controls the width of the bell curve. A small $\sigma$ means values cluster tightly around the mean (a tall, narrow bell), while a large $\sigma$ means they spread out widely (a short, wide bell). The variance $\sigma^2$ appears in the formula as the natural unit of spread.

$\frac{1}{\sigma\sqrt{2\pi}}$ — the normalising constant. This factor ensures the total area under the curve is exactly 1 (making it a valid PDF). Without it, changing $\sigma$ would change the total area. When $\sigma$ is small, the bell is narrow but must still have area 1, so this factor makes it tall. When $\sigma$ is large, the bell is wide, so this factor makes it short. It's purely a bookkeeping term that normalises the total probability to 1.

$-\frac{(x - \mu)^2}{2\sigma^2}$ — the exponent. This is the heart of the formula. It's a squared distance from the mean, scaled by the variance, with a minus sign. $(x - \mu)^2$ measures how far $x$ is from the centre — it's always non-negative and symmetric (being 5 units above the mean is treated the same as 5 units below). Dividing by $2\sigma^2$ normalises this distance by the spread: a deviation of 1 in a distribution with $\sigma = 0.5$ is a big deal (exponent = $-2$), but in a distribution with $\sigma = 10$ it's negligible (exponent = $-0.005$). The minus sign means points far from $\mu$ get a very negative exponent, which drives $\exp(\cdot)$ toward zero.

Now let's trace the boundary behaviour:

  • At $x = \mu$: the exponent is $-\frac{(\mu - \mu)^2}{2\sigma^2} = 0$, so $\exp(0) = 1$, and $f(\mu) = \frac{1}{\sigma\sqrt{2\pi}}$. This is the peak of the bell curve. Notice the peak height is inversely proportional to $\sigma$ — a tighter distribution has a taller peak.
  • As $|x - \mu| \to \infty$: the exponent goes to $-\infty$, so $f(x) \to 0$. The tails of the bell curve decay toward zero, but they never quite reach it — the normal distribution has infinite support (it's defined over all real numbers), though values far from the mean are astronomically unlikely.
  • As $\sigma \to 0$: the distribution becomes infinitely tall and infinitely narrow, concentrating all probability at a single point $\mu$. In the limit, this approaches a Dirac delta function — a "spike" of infinite height but zero width that still has area 1.
  • As $\sigma \to \infty$: the distribution flattens toward zero everywhere. The bell becomes so wide that density at any given point is negligible, approaching a uniform distribution over the entire real line.

A useful rule of thumb is the 68-95-99.7 rule : approximately 68% of values drawn from a normal distribution fall within $1\sigma$ of the mean, 95% within $2\sigma$, and 99.7% within $3\sigma$. This tells you that values more than $3\sigma$ from the mean are extremely rare (about 3 in 1,000).

The standard normal distribution is the special case with $\mu = 0$ and $\sigma = 1$, written $\mathcal{N}(0, 1)$. It serves as the reference distribution — any normal distribution $\mathcal{N}(\mu, \sigma^2)$ can be converted to it by the substitution $z = \frac{x - \mu}{\sigma}$, a process called standardisation .

Why does this distribution matter so much for deep learning?

  • Weight initialisation: schemes like Glorot (Xavier) and He/Kaiming sample initial weights from normal distributions with carefully chosen $\sigma$ to maintain stable gradient flow through deep networks.
  • Pre-trained weight distributions: empirically, the weights of trained neural networks follow approximately normal distributions. This observation is what makes NF4 quantization work — placing quantization levels at the quantiles of a normal distribution matches where the weights actually are.
  • Diffusion models: the forward process in diffusion-based image generation progressively adds Gaussian noise to an image until it becomes pure noise from $\mathcal{N}(0, I)$. The reverse process learns to denoise step by step.
  • The Central Limit Theorem: the sum of many independent random variables tends toward a normal distribution, regardless of the original distributions. Deep learning is full of sums — weighted sums of inputs, averages over batches, accumulated gradients — so normal distributions emerge naturally.

The plot below overlays three normal PDFs to illustrate how $\mu$ and $\sigma$ affect the shape. $\mathcal{N}(0, 1)$ is the standard bell curve. $\mathcal{N}(0, 0.5)$ is taller and narrower (smaller $\sigma$ concentrates probability). $\mathcal{N}(2, 1)$ has the same shape as the standard normal but is shifted right (different $\mu$).

import math, json, js

def norm_pdf(x, mu=0, sigma=1):
    coeff = 1.0 / (sigma * math.sqrt(2 * math.pi))
    exponent = -((x - mu) ** 2) / (2 * sigma ** 2)
    return coeff * math.exp(exponent)

x_vals = [i * 0.05 for i in range(-80, 101)]  # -4.0 to 5.0

y_standard = [norm_pdf(x, 0, 1) for x in x_vals]
y_narrow   = [norm_pdf(x, 0, 0.5) for x in x_vals]
y_shifted  = [norm_pdf(x, 2, 1) for x in x_vals]

plot_data = [
    {
        "title": "Normal PDFs with different μ and σ",
        "x_label": "x",
        "y_label": "f(x)",
        "x_data": x_vals,
        "lines": [
            {"label": "N(0, 1)",   "data": y_standard, "color": "#3b82f6"},
            {"label": "N(0, 0.5)", "data": y_narrow,   "color": "#f59e0b"},
            {"label": "N(2, 1)",   "data": y_shifted,   "color": "#10b981"}
        ]
    }
]
js.window.py_plot_data = json.dumps(plot_data)
💡 Notice that $\mathcal{N}(0, 0.5)$ peaks above $f(x) = 0.7$ — well above 1 if $\sigma$ were even smaller. This is a concrete example of a PDF taking values greater than 1 and being perfectly valid. The density is high because the probability is concentrated in a very narrow band, but the total area is still exactly 1.

The Cumulative Distribution Function (CDF)

The PDF tells us where probability is dense, but often we need a different question answered: what is the probability that $X$ is less than or equal to some value $x$? For instance, what fraction of neural network weights fall below 0.5? What's the probability that a random sample from $\mathcal{N}(0, 1)$ is negative? The cumulative distribution function (CDF) answers exactly this.

$$F(x) = P(X \leq x) = \int_{-\infty}^{x} f(t) \, dt$$

Let's break this apart.

$F(x)$ is the CDF evaluated at point $x$. It gives the total probability accumulated from $-\infty$ up to $x$ — the area under the PDF curve to the left of $x$.

$\int_{-\infty}^{x} f(t) \, dt$ is the integral of the PDF from negative infinity up to $x$. We use $t$ as the dummy variable of integration (instead of $x$) to distinguish the variable being integrated over from the fixed upper limit $x$. As $x$ increases, we accumulate more area, so $F(x)$ grows — that's why it's called the "cumulative" distribution function.

The CDF has several important properties:

  • Monotonically non-decreasing: $F(x)$ can only go up or stay flat, never decrease. Once probability has been accumulated, it can't be un-accumulated.
  • $\lim_{x \to -\infty} F(x) = 0$: as we move infinitely far to the left, no probability has been accumulated yet.
  • $\lim_{x \to \infty} F(x) = 1$: as we move infinitely far to the right, all probability has been accumulated.
  • $F(x) \in [0, 1]$: the CDF is always a valid probability.

For the standard normal distribution $\mathcal{N}(0, 1)$, the CDF is traditionally denoted $\Phi(x)$:

$$\Phi(x) = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{x} e^{-t^2/2} \, dt$$

This integral has no closed-form solution — there is no combination of elementary functions that equals $\Phi(x)$. We must compute it numerically. Fortunately, $\Phi(x)$ can be expressed using the error function $\text{erf}(x)$, which is available in most programming languages:

$$\Phi(x) = \frac{1}{2}\left[1 + \text{erf}\!\left(\frac{x}{\sqrt{2}}\right)\right]$$

Some key values worth memorising:

  • $\Phi(0) = 0.5$ — exactly half the distribution lies below the mean. By symmetry, exactly half lies above it.
  • $\Phi(1.96) \approx 0.975$ — 97.5% of the distribution lies below $x = 1.96$. This is the basis of the famous 95% confidence interval: the central 95% of the distribution lies between $-1.96$ and $+1.96$.
  • $\Phi(-1.96) \approx 0.025$ — only 2.5% of the distribution lies below $x = -1.96$, the mirror image of the previous value.

The plot below shows $\Phi(x)$ — the characteristic S-shaped (sigmoid-like) curve. It starts near 0 on the far left, passes through 0.5 at $x = 0$, and approaches 1 on the far right. The steepest part of the curve is at $x = 0$, where the PDF (the derivative of the CDF) is highest.

import math, json, js

def norm_cdf(x):
    """CDF of the standard normal distribution using math.erf."""
    return 0.5 * (1.0 + math.erf(x / math.sqrt(2.0)))

x_vals = [i * 0.05 for i in range(-80, 81)]  # -4.0 to 4.0
y_cdf = [norm_cdf(x) for x in x_vals]

# Horizontal reference lines at key probabilities
y_half = [0.5 for _ in x_vals]
y_025 = [0.025 for _ in x_vals]
y_975 = [0.975 for _ in x_vals]

plot_data = [
    {
        "title": "Standard Normal CDF Φ(x)",
        "x_label": "x",
        "y_label": "Φ(x)",
        "x_data": x_vals,
        "lines": [
            {"label": "Φ(x)", "data": y_cdf, "color": "#8b5cf6"},
            {"label": "Φ = 0.5 (at x = 0)", "data": y_half, "color": "#9ca3af"},
            {"label": "Φ = 0.025 (at x ≈ -1.96)", "data": y_025, "color": "#f87171"},
            {"label": "Φ = 0.975 (at x ≈ 1.96)", "data": y_975, "color": "#f87171"}
        ]
    }
]
js.window.py_plot_data = json.dumps(plot_data)

The following code computes the CDF at several important $x$ values and displays them as a table. These values appear repeatedly in statistics and machine learning — they're worth having intuition for.

import math, json, js

def norm_cdf(x):
    return 0.5 * (1.0 + math.erf(x / math.sqrt(2.0)))

x_values = [-3.0, -2.0, -1.96, -1.0, 0.0, 1.0, 1.96, 2.0, 3.0]

rows = []
for x in x_values:
    cdf_val = norm_cdf(x)
    interpretation = ""
    if x == 0.0:
        interpretation = "Mean: exactly half below"
    elif abs(x) == 1.0:
        interpretation = f"{'Below' if x < 0 else 'Above'} 1σ"
    elif abs(x - 1.96) < 0.01 or abs(x + 1.96) < 0.01:
        interpretation = "95% confidence boundary"
    elif abs(x) == 2.0:
        interpretation = f"{'Below' if x < 0 else 'Above'} 2σ"
    elif abs(x) == 3.0:
        interpretation = f"{'Below' if x < 0 else 'Above'} 3σ"
    rows.append([f"{x:+.2f}", f"{cdf_val:.6f}", interpretation])

js.window.py_table_data = json.dumps({
    "headers": ["x", "Φ(x)", "Interpretation"],
    "rows": rows
})

print("Key insight: Φ(x) + Φ(-x) = 1 for all x (by symmetry)")
print(f"Verify: Φ(1.96) + Φ(-1.96) = {norm_cdf(1.96):.6f} + {norm_cdf(-1.96):.6f} = {norm_cdf(1.96) + norm_cdf(-1.96):.6f}")
💡 The CDF gives us probabilities for ranges too. Since $P(a \leq X \leq b) = \Phi(b) - \Phi(a)$, we can verify the 68-95-99.7 rule: $\Phi(1) - \Phi(-1) \approx 0.8413 - 0.1587 = 0.6827$, or about 68.3% within $\pm 1\sigma$. For $\pm 2\sigma$: $\Phi(2) - \Phi(-2) \approx 0.9772 - 0.0228 = 0.9545$, or 95.4%.

The Quantile Function (Inverse CDF)

The CDF answers "what fraction of the distribution lies below $x$?" But sometimes we need the reverse question: "at what value of $x$ does a given fraction $p$ of the distribution lie below it?" For example, what value separates the bottom 97.5% from the top 2.5%? This is what the quantile function (also called the inverse CDF or percent-point function ) provides.

For the standard normal, the quantile function is written $\Phi^{-1}(p)$. Given a probability $p \in (0, 1)$, it returns the $z$-score where $\Phi(z) = p$ — the point where exactly $p$ fraction of the distribution lies to its left.

Some key examples:

  • $\Phi^{-1}(0.5) = 0$ — the median. Half the distribution lies below 0, which makes sense because the standard normal is symmetric about 0.
  • $\Phi^{-1}(0.975) \approx 1.96$ — the 97.5th percentile. This is the value above which only 2.5% of the distribution lies. It's the right boundary of the 95% confidence interval.
  • $\Phi^{-1}(0.025) \approx -1.96$ — the 2.5th percentile. Only 2.5% of the distribution lies below this value. It's the left boundary of the 95% confidence interval.

Now for the boundary behaviour:

  • $\Phi^{-1}(0) = -\infty$: no finite value captures 0% of the distribution. You'd have to go infinitely far left to have zero accumulated probability.
  • $\Phi^{-1}(1) = +\infty$: you'd need to include the entire infinite real line to capture 100% of the distribution.
  • $\Phi^{-1}(0.5) = 0$: the standard normal is symmetric about 0, so the median equals the mean.

This function has a direct and important application in deep learning. QLoRA 's NF4 quantization uses the quantile function to design a 4-bit data type optimised for normally-distributed weights. The idea: divide the probability range $[0, 1]$ into $2^b = 16$ equal-probability bins and place each quantization level at the midpoint quantile of its bin:

$$q_i = \Phi^{-1}\!\left(\frac{2i + 1}{2 \cdot 2^b}\right)$$

This places levels densely near $\mu = 0$ (where most weights cluster) and sparsely in the tails (where few weights exist), minimising quantization error under the assumption that weights are approximately normally distributed.

The plot below shows the quantile function — an inverse S-curve. It's a mirror image of the CDF: where the CDF maps $x$-values to probabilities, the quantile function maps probabilities back to $x$-values. Notice how it's steep near $p = 0$ and $p = 1$ (a small change in probability corresponds to a large change in $x$ in the tails) and relatively flat near $p = 0.5$ (probabilities accumulate quickly near the centre).

import math, json, js

def norm_ppf(p):
    """Approximate inverse CDF of standard normal (Abramowitz & Stegun)."""
    if p <= 0:
        return -4.0
    if p >= 1:
        return 4.0
    if p > 0.5:
        return -norm_ppf(1.0 - p)
    t = math.sqrt(-2.0 * math.log(p))
    c0, c1, c2 = 2.515517, 0.802853, 0.010328
    d1, d2, d3 = 1.432788, 0.189269, 0.001308
    return -(t - (c0 + c1*t + c2*t*t) / (1 + d1*t + d2*t*t + d3*t*t*t))

# Probabilities from 0.01 to 0.99
p_vals = [i * 0.005 for i in range(2, 199)]  # 0.01 to 0.99
z_vals = [norm_ppf(p) for p in p_vals]

plot_data = [
    {
        "title": "Standard Normal Quantile Function Φ⁻¹(p)",
        "x_label": "Probability p",
        "y_label": "z-score (Φ⁻¹(p))",
        "x_data": p_vals,
        "lines": [
            {"label": "Φ⁻¹(p)", "data": z_vals, "color": "#ef4444"}
        ]
    }
]
js.window.py_plot_data = json.dumps(plot_data)

The following code computes quantile values at key probabilities and shows how NF4 uses them to place its 16 quantization levels:

import math, json, js

def norm_ppf(p):
    """Approximate inverse CDF of standard normal (Abramowitz & Stegun)."""
    if p <= 0: return -4.0
    if p >= 1: return 4.0
    if p > 0.5:
        return -norm_ppf(1.0 - p)
    t = math.sqrt(-2.0 * math.log(p))
    c0, c1, c2 = 2.515517, 0.802853, 0.010328
    d1, d2, d3 = 1.432788, 0.189269, 0.001308
    return -(t - (c0 + c1*t + c2*t*t) / (1 + d1*t + d2*t*t + d3*t*t*t))

# Key quantile values
rows_key = []
for p in [0.025, 0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95, 0.975]:
    z = norm_ppf(p)
    rows_key.append([f"{p:.3f}", f"{z:+.4f}", f"{p*100:.1f}th percentile"])

# NF4 levels
rows_nf4 = []
for i in range(16):
    p = (2 * i + 1) / 32
    z = norm_ppf(p)
    rows_nf4.append([str(i), f"{p:.5f}", f"{z:+.4f}"])

# Combine into two tables displayed sequentially
js.window.py_table_data = json.dumps({
    "headers": ["Probability p", "Φ⁻¹(p)", "Meaning"],
    "rows": rows_key
})

print("\nNF4 quantization levels (16 levels for 4-bit):")
print(f"{'Level':>5} {'Prob p':>10} {'Φ⁻¹(p)':>10}")
print("-" * 28)
for row in rows_nf4:
    print(f"{row[0]:>5} {row[1]:>10} {row[2]:>10}")
💡 The quantile function is the mathematical bridge between "equal probability bins" and "where to place quantization levels." Because the normal distribution concentrates its mass near the mean, equal-probability bins translate to tightly-spaced levels near zero and widely-spaced levels in the tails — exactly the design that minimises quantization error for normally-distributed weights.

Connecting It All Together

The PDF, CDF, and quantile function form a chain, each derived from the previous:

  • PDF $f(x)$: "How dense is probability at this point?" This is the starting point. It tells you where values are likely to cluster, but its values are densities, not probabilities.
  • CDF $F(x) = \int_{-\infty}^{x} f(t) \, dt$: "How much total probability lies to the left of this point?" Obtained by integrating the PDF. It converts density information into cumulative probabilities.
  • Quantile $F^{-1}(p)$: "At what point has this much probability accumulated?" Obtained by inverting the CDF. It converts probabilities back into values on the original scale.

These three functions are different views of the same underlying distribution. The CDF is the integral of the PDF. The PDF is the derivative of the CDF. The quantile function is the functional inverse of the CDF. Knowing any one of them fully determines the other two.

The plot below shows all three functions for the standard normal distribution on a single chart. The blue PDF peaks at $x = 0$ with density $\approx 0.399$. The purple CDF passes through 0.5 at $x = 0$ (half the probability accumulated). And the red quantile function, plotted on a different conceptual axis, shows the inverse mapping.

import math, json, js

def norm_pdf(x, mu=0, sigma=1):
    coeff = 1.0 / (sigma * math.sqrt(2 * math.pi))
    exponent = -((x - mu) ** 2) / (2 * sigma ** 2)
    return coeff * math.exp(exponent)

def norm_cdf(x):
    return 0.5 * (1.0 + math.erf(x / math.sqrt(2.0)))

def norm_ppf(p):
    if p <= 0: return -4.0
    if p >= 1: return 4.0
    if p > 0.5:
        return -norm_ppf(1.0 - p)
    t = math.sqrt(-2.0 * math.log(p))
    c0, c1, c2 = 2.515517, 0.802853, 0.010328
    d1, d2, d3 = 1.432788, 0.189269, 0.001308
    return -(t - (c0 + c1*t + c2*t*t) / (1 + d1*t + d2*t*t + d3*t*t*t))

x_vals = [i * 0.05 for i in range(-80, 81)]  # -4.0 to 4.0

y_pdf = [norm_pdf(x) for x in x_vals]
y_cdf = [norm_cdf(x) for x in x_vals]

# For the quantile function, we plot it on the same x-axis
# by treating x as a probability input (only valid for 0 < x < 1 region)
# Instead, we plot the inverse: for each x, show ppf(cdf(x)) = x (identity check)
# Better: show all three as separate sub-plots

# We use two charts: one for PDF + CDF (same x-axis), one for quantile function

plot_data = [
    {
        "title": "PDF and CDF of the Standard Normal Distribution",
        "x_label": "x",
        "y_label": "Value",
        "x_data": x_vals,
        "lines": [
            {"label": "PDF f(x)", "data": y_pdf, "color": "#3b82f6"},
            {"label": "CDF Φ(x)", "data": y_cdf, "color": "#8b5cf6"}
        ]
    },
    {
        "title": "Quantile Function Φ⁻¹(p)",
        "x_label": "Probability p",
        "y_label": "z-score",
        "x_data": [i * 0.005 for i in range(2, 199)],
        "lines": [
            {"label": "Φ⁻¹(p)", "data": [norm_ppf(i * 0.005) for i in range(2, 199)], "color": "#ef4444"}
        ]
    }
]
js.window.py_plot_data = json.dumps(plot_data)

The table below shows the correspondence between all three functions at key points, making the chain of relationships concrete:

import math, json, js

def norm_pdf(x, mu=0, sigma=1):
    coeff = 1.0 / (sigma * math.sqrt(2 * math.pi))
    exponent = -((x - mu) ** 2) / (2 * sigma ** 2)
    return coeff * math.exp(exponent)

def norm_cdf(x):
    return 0.5 * (1.0 + math.erf(x / math.sqrt(2.0)))

def norm_ppf(p):
    if p <= 0: return -4.0
    if p >= 1: return 4.0
    if p > 0.5:
        return -norm_ppf(1.0 - p)
    t = math.sqrt(-2.0 * math.log(p))
    c0, c1, c2 = 2.515517, 0.802853, 0.010328
    d1, d2, d3 = 1.432788, 0.189269, 0.001308
    return -(t - (c0 + c1*t + c2*t*t) / (1 + d1*t + d2*t*t + d3*t*t*t))

x_values = [-3.0, -1.96, -1.0, 0.0, 1.0, 1.96, 3.0]
rows = []
for x in x_values:
    f_x = norm_pdf(x)
    phi_x = norm_cdf(x)
    # Verify round-trip: ppf(cdf(x)) should equal x
    roundtrip = norm_ppf(phi_x)
    rows.append([
        f"{x:+.2f}",
        f"{f_x:.4f}",
        f"{phi_x:.4f}",
        f"{roundtrip:+.4f}"
    ])

js.window.py_table_data = json.dumps({
    "headers": ["x", "PDF f(x)", "CDF Φ(x)", "Φ⁻¹(Φ(x)) ≈ x"],
    "rows": rows
})

print("The last column verifies the round-trip: applying Φ⁻¹ to Φ(x) recovers x.")
print("Small discrepancies are due to the rational approximation used for Φ⁻¹.")
💡 The round-trip property $\Phi^{-1}(\Phi(x)) = x$ is what makes these functions inverses. In practice, the rational approximation for $\Phi^{-1}$ introduces tiny errors (on the order of $10^{-4}$), but for applications like NF4 quantization these errors are negligible — the quantization levels don't need to be exact to sub-thousandth precision.

Quiz

Test your understanding of probability distributions, PDFs, CDFs, and the normal distribution.

If a probability density function has f(x) = 2.5 at some point x, is this a valid PDF?

What is Φ(0) for the standard normal distribution N(0, 1)?

What does Φ⁻¹(0.975) approximately equal for the standard normal?

Why does QLoRA use the quantile function Φ⁻¹ to compute NF4 quantization levels?