Speculative Decoding

What If We Could Guess Multiple Tokens at Once?

Every optimisation we've covered so far — KV caching , quantisation , continuous batching — accepts a fundamental constraint: autoregressive decoding generates one token per forward pass of the large model. Each forward pass reads ALL model weights from GPU memory, performs a small matrix-vector multiply for a single token, and writes the result back. As we established in article 1 , this makes decode memory-bandwidth-bound : the GPU spends most of its time shuttling weights from HBM rather than doing useful arithmetic. We've reduced the size of those weights (quantisation) and amortised the cost across requests (batching), but we're still generating one token at a time.

But what if we could break that one-token-per-pass constraint? What if a small, fast model could guess the next several tokens, and then the large model could verify all of those guesses in a single forward pass? Verifying multiple tokens at once is cheap — it looks just like prefill (processing a known sequence in one batched pass), which is compute-bound and efficient, unlike decode's memory-bound one-token-at-a-time grind.

That's speculative decoding . The small model drafts , the large model edits . Two independent papers introduced the idea nearly simultaneously: Leviathan et al. called it "speculative decoding" (Leviathan et al., 2023) , while Chen et al. framed it as "speculative sampling" (Chen et al., 2023) . Both arrived at the same core algorithm and, crucially, the same guarantee: the output distribution is mathematically identical to running the large model alone. There is no quality loss. It's as close to a free lunch as inference optimisation gets.

💡 The analogy is like a junior associate drafting a document and a senior partner reviewing it. The senior can review (verify) a full draft far faster than writing it from scratch. If the draft is good, the senior just signs off. If something is wrong, the senior fixes it at the first error. Either way, the senior spends less time than if they had written every word themselves.

How Speculative Decoding Works

The algorithm runs in rounds. Each round has three steps: draft, verify, and accept-or-reject. Let's call the large model the target model (the one whose output quality we want) and the small model the draft model (the one that guesses ahead). We choose a speculation length $K$ — the number of tokens the draft model will guess per round.

Step 1 — Draft: the draft model generates $K$ candidate tokens autoregressively, one at a time: $\hat{y}_1, \hat{y}_2, \ldots, \hat{y}_K$. Because the draft model is small (perhaps 10-100x fewer parameters than the target), each of these $K$ forward passes is very fast. At each position $i$, the draft model also records the probability it assigned to its chosen token, $q_i(\hat{y}_i)$, from its output distribution $q_i$.

Step 2 — Verify: we append all $K$ draft tokens to the current context and feed the entire extended sequence through the target model in a single forward pass . This is the key insight: the target model processes all $K$ candidate positions in parallel, exactly like prefill . We read the model weights from memory once, and compute output distributions $p_1, p_2, \ldots, p_K$ (plus one extra position $p_{K+1}$ for the token after the last draft token) in that single pass. This one forward pass of the target model replaces what would otherwise be $K$ separate sequential forward passes.

Step 3 — Accept or reject: we scan the $K$ positions from left to right. At each position $i$, we compare the target model's probability $p_i(\hat{y}_i)$ against the draft model's probability $q_i(\hat{y}_i)$ using a rejection-sampling criterion (detailed in the next section). If the token is accepted, we move to position $i+1$. The moment a token is rejected, we stop — we resample that position from the target model's distribution (with a correction to preserve exactness), discard all subsequent draft tokens, and begin a new round from the corrected position.

The cost of one round is therefore:

\text{Cost per round} = K \cdot C_{\text{draft}} + 1 \cdot C_{\text{target}}

where $C_{\text{draft}}$ is the cost of one draft model forward pass and $C_{\text{target}}$ is the cost of one target model forward pass. Since the draft model is much smaller, $C_{\text{draft}} \ll C_{\text{target}}$, so $K \cdot C_{\text{draft}}$ is often negligible compared to $C_{\text{target}}$. In the best case, all $K$ tokens are accepted, plus we get one bonus token from the target model's distribution at position $K+1$. That gives us $K + 1$ tokens for roughly the cost of one target forward pass — a $(K+1)\times$ speedup over standard decoding.

In the worst case, the very first draft token is rejected. We still get one token (resampled from the target model at position 1), which is exactly what standard decoding would have produced. The only overhead is the wasted draft computation $K \cdot C_{\text{draft}}$, which is small. So speculative decoding is never significantly worse than standard decoding and can be up to $(K+1)\times$ better.

💡 Why does verification work like prefill? During prefill, the model processes a sequence of known tokens in one pass by computing attention across all positions simultaneously. Verification is identical: the draft tokens are "known" (proposed by the draft model), so the target model can process them all in one batched matrix multiply, computing its output distribution at every position in parallel.

Here's pseudocode that captures the full loop:

def speculative_decode(target_model, draft_model, prompt, K, max_tokens):
    """Generate tokens using speculative decoding."""
    tokens = list(prompt)

    while len(tokens) - len(prompt) < max_tokens:
        # ── Step 1: Draft ──────────────────────────────────────────
        draft_tokens = []
        draft_probs = []
        draft_context = list(tokens)
        for _ in range(K):
            q = draft_model.get_distribution(draft_context)  # draft dist
            y_hat = sample(q)
            draft_tokens.append(y_hat)
            draft_probs.append(q[y_hat])
            draft_context.append(y_hat)

        # ── Step 2: Verify ─────────────────────────────────────────
        # Single forward pass: target processes all K draft tokens at once
        # Returns distributions p_1, ..., p_K, p_{K+1}
        target_dists = target_model.verify(tokens, draft_tokens)

        # ── Step 3: Accept / Reject ────────────────────────────────
        accepted = 0
        for i in range(K):
            p_i = target_dists[i][draft_tokens[i]]  # target prob
            q_i = draft_probs[i]                     # draft prob
            # Rejection sampling: accept with min(1, p/q)
            if random.random() < min(1.0, p_i / q_i):
                tokens.append(draft_tokens[i])
                accepted += 1
            else:
                # Resample from corrected distribution
                corrected = max(0, target_dists[i] - draft_probs_dist[i])
                corrected /= corrected.sum()
                tokens.append(sample(corrected))
                break  # discard remaining draft tokens
        else:
            # All K accepted — bonus token from p_{K+1}
            tokens.append(sample(target_dists[K]))

    return tokens

The Acceptance Criterion

The magic of speculative decoding is that despite using a cheap draft model, the final output distribution is provably identical to what the target model would produce on its own. No approximation, no quality loss. This guarantee comes from the acceptance criterion, which is a form of rejection sampling — a classical technique from computational statistics adapted here for autoregressive generation.

At each draft position $i$, the draft model proposed token $\hat{y}_i$ with probability $q_i(\hat{y}_i)$ from its distribution $q_i$, and the target model assigns probability $p_i(\hat{y}_i)$ from its distribution $p_i$ (computed during the verification pass). The acceptance probability is:

\alpha_i = \min\left(1, \; \frac{p_i(\hat{y}_i)}{q_i(\hat{y}_i)}\right)

Let's examine the boundary cases. When $p_i(\hat{y}_i) \geq q_i(\hat{y}_i)$ — the target model assigns equal or higher probability to the drafted token than the draft model did — the ratio $p/q \geq 1$, so $\min(1, p/q) = 1$. The token is always accepted . This happens when the draft model and target model agree, or when the target is even more confident about this token than the draft was. There's no reason to reject a token the target model likes.

When $p_i(\hat{y}_i) < q_i(\hat{y}_i)$ — the target model assigns lower probability than the draft model — the ratio $p/q < 1$, so we accept with probability $p/q$. The draft model was overconfident about this token relative to the target, and we reject proportionally more often. If the draft assigned probability 0.8 but the target assigns 0.2, we accept with probability $0.2/0.8 = 0.25$ and reject 75% of the time. If the target assigns probability 0.0 (this token should never appear), we always reject ($0/0.8 = 0$). Conversely, if $p$ and $q$ are close, say 0.3 and 0.4, acceptance probability is $0.3/0.4 = 0.75$ — fairly high, reflecting mild disagreement.

When a token is rejected, we don't just sample from the target distribution $p_i$ directly (that would bias toward tokens the draft model already favoured). Instead, we sample from a corrected distribution that accounts for the tokens that were already accepted via the rejection-sampling process:

p'_i(y) = \frac{\max\big(0, \; p_i(y) - q_i(y)\big)}{\sum_{y'} \max\big(0, \; p_i(y') - q_i(y')\big)}

This corrected distribution upweights tokens where the target model assigns more probability than the draft model and completely zeroes out tokens where the draft model was more confident. The $(p - q)^+$ numerator captures the "residual" probability mass that the draft model underweighted, and normalisation turns it into a valid distribution. The mathematical proof that this combination of acceptance criterion plus corrected resampling preserves the exact target distribution can be found in both original papers — the intuition is that it's a standard rejection-sampling scheme where the proposal distribution is $q$ and the target distribution is $p$.

💡 This corrected distribution $p'_i$ only has support on tokens where $p_i(y) > q_i(y)$. These are precisely the tokens the draft model was underestimating. So when we reject, we're redirecting probability mass toward tokens the target model favoured but the draft model overlooked.

How many tokens do we expect to accept per round? If we denote the per-position acceptance rate as $\alpha_j$ (averaged over the token distribution), the expected number of accepted tokens is:

\mathbb{E}[\text{accepted}] = \sum_{i=1}^{K} \prod_{j=1}^{i} \alpha_j

This formula reflects the left-to-right sequential structure of acceptance: to accept position $i$, we must have also accepted all positions $1, 2, \ldots, i-1$. The product $\prod_{j=1}^{i} \alpha_j$ is the probability that the first $i$ tokens are all accepted, and summing over $i$ gives the expected count.

Let's check the boundaries. If the draft model is perfect ($q = p$ everywhere), then $\alpha_j = \min(1, p/q) = 1$ for all $j$, every product is 1, and the sum equals $K$. All $K$ tokens are always accepted, yielding a $(K+1)\times$ speedup (including the bonus token). This is the theoretical maximum, achieved when the draft model perfectly mimics the target.

If the draft model is terrible ($q$ very different from $p$, so $\alpha_j \approx 0$), then even the first product $\alpha_1 \approx 0$, the sum approaches 0, and we accept almost nothing. We still get 1 token per round (the resampled correction), which is what standard decoding produces. The only penalty is the wasted draft computation. So the speedup degrades gracefully from $(K+1)\times$ down toward $1\times$ as draft quality decreases — it never makes things significantly worse.

For a simplified analysis where $\alpha_j = \alpha$ is constant across positions, the expected accepted count becomes the geometric series:

\mathbb{E}[\text{accepted}] = \alpha + \alpha^2 + \cdots + \alpha^K = \frac{\alpha(1 - \alpha^K)}{1 - \alpha}

At $\alpha = 0.9$ (a well-matched draft/target pair) with $K = 5$, this gives $0.9 + 0.81 + 0.729 + 0.6561 + 0.5905 \approx 3.69$ tokens accepted per round, plus the bonus token from the target model, for roughly 4.69 tokens per round — close to a $5\times$ speedup if we ignore draft overhead. At $\alpha = 0.5$ (mediocre draft) with $K = 5$, we get $0.5 + 0.25 + 0.125 + 0.0625 + 0.03125 \approx 0.97$ tokens, so about 2 tokens per round including the bonus — roughly $2\times$. Use the slider below to explore how different speculation lengths $K$ change the speedup curve.

import math, json, js

K_values = [1, 2, 3, 4, 5, 7, 10]
default_K_idx = 4  # K=5 selected by default

# Sweep alpha from 0 to 1
alphas = [i / 200 for i in range(0, 201)]  # 0.000, 0.005, ..., 1.000

traces = []
annotations_per_K = []

for ki, K in enumerate(K_values):
    speedups = []
    breakeven_alpha = None
    for alpha in alphas:
        # E[accepted] = alpha + alpha^2 + ... + alpha^K = alpha*(1 - alpha^K)/(1 - alpha)
        if abs(alpha - 1.0) < 1e-9:
            expected_accepted = float(K)
        else:
            expected_accepted = alpha * (1.0 - alpha**K) / (1.0 - alpha)
        # Total tokens per round = accepted + 1 (bonus token from target)
        speedup = expected_accepted + 1.0
        speedups.append(round(speedup, 4))
        # Find break-even: speedup crosses 2.0 (doubling baseline)
        if breakeven_alpha is None and speedup >= 2.0:
            breakeven_alpha = alpha

    # Speedup curve for this K
    traces.append({
        "x": [round(a, 4) for a in alphas],
        "y": speedups,
        "type": "scatter",
        "mode": "lines",
        "name": f"K = {K}",
        "line": {"color": "#3b82f6", "width": 3},
        "visible": ki == default_K_idx,
        "hovertemplate": "\u03b1 = %{x:.2f}<br>Speedup = %{y:.2f}x<extra>K=" + str(K) + "</extra>"
    })

    # Store break-even annotation for this K
    if breakeven_alpha is not None:
        annotations_per_K.append({
            "x": breakeven_alpha,
            "y": 2.0,
            "xref": "x",
            "yref": "y",
            "text": f"2x at \u03b1={breakeven_alpha:.2f}",
            "showarrow": True,
            "arrowhead": 2,
            "ax": 40,
            "ay": -30,
            "font": {"size": 12, "color": "#f59e0b"},
            "arrowcolor": "#f59e0b"
        })
    else:
        annotations_per_K.append(None)

# Reference line at speedup = 1 (always visible)
traces.append({
    "x": [0.0, 1.0],
    "y": [1.0, 1.0],
    "type": "scatter",
    "mode": "lines",
    "name": "Baseline (1x)",
    "line": {"color": "#9ca3af", "width": 2, "dash": "dot"},
    "visible": True,
    "hoverinfo": "skip"
})

num_K = len(K_values)

# Build slider steps: toggle visibility of speedup curves + always show reference line
steps = []
for i, K in enumerate(K_values):
    visibility = [False] * num_K + [True]  # last trace (reference) always visible
    visibility[i] = True
    # Build annotations for this step
    step_annotations = []
    ann = annotations_per_K[i]
    if ann is not None:
        step_annotations.append(ann)
    steps.append({
        "method": "update",
        "args": [
            {"visible": visibility},
            {"annotations": step_annotations,
             "title": {"text": f"Speculative Decoding Speedup (K = {K})"}}
        ],
        "label": str(K)
    })

# Default annotations
default_annotations = []
default_ann = annotations_per_K[default_K_idx]
if default_ann is not None:
    default_annotations.append(default_ann)

layout = {
    "title": {"text": f"Speculative Decoding Speedup (K = {K_values[default_K_idx]})"},
    "xaxis": {"title": {"text": "Acceptance Rate (\u03b1)"}, "range": [0, 1.02]},
    "yaxis": {"title": {"text": "Speedup (tokens per round)"}, "range": [0.8, 12]},
    "sliders": [{
        "active": default_K_idx,
        "pad": {"t": 50},
        "currentvalue": {"prefix": "Speculation length K = ", "visible": True},
        "steps": steps
    }],
    "annotations": default_annotations,
    "showlegend": True,
    "legend": {"x": 0.02, "y": 0.98}
}

js.window.py_plotly_data = json.dumps({"data": traces, "layout": layout})

# Print summary for selected K values
for K in K_values:
    alphas_check = [0.5, 0.7, 0.9]
    parts = []
    for a in alphas_check:
        ea = a * (1.0 - a**K) / (1.0 - a)
        parts.append(f"\u03b1={a}: {ea+1:.2f}x")
    print(f"K={K}: " + ", ".join(parts) + f", max={K+1}x")

The plot shows the nonlinear relationship between draft quality and speedup. At $\alpha = 0.7$, we already get roughly $3.3\times$ total tokens per round — a substantial gain. But below $\alpha = 0.5$, returns diminish sharply, and the speedup approaches the $1\times$ baseline. This curve makes the choice of draft model critically important: a small improvement in acceptance rate translates to a large improvement in throughput. Use the K slider to see how longer speculation lengths amplify this effect — higher K means greater potential speedup but requires higher acceptance rates to realize the gains.

Choosing the Draft Model

The draft model must satisfy two competing requirements: it must be fast enough that $K$ draft forward passes don't dominate the cost of one target forward pass, and it must be accurate enough that a high fraction of its guesses are accepted. If the draft model is too slow, the overhead of drafting erodes the savings from parallel verification. If it's too inaccurate, most tokens get rejected and we fall back to essentially standard decoding.

The most common approach is to use a smaller model from the same family . For example, LLaMA-7B or LLaMA-13B can serve as a draft model for LLaMA-70B. Models from the same family share the same tokenizer and were trained on similar data, so their output distributions tend to agree on high-probability tokens (common continuations, factual completions, syntactic patterns). A smaller model from a completely different family with a different tokenizer won't work — the vocabulary mismatch alone makes comparison meaningless.

A second option is to use a quantised version of the target model itself as the draft model. If the target runs in FP16, a heavily quantised 2-bit or 3-bit version of the same model can serve as the drafter. The distributions are naturally well-aligned because it's the same model (just compressed), and as we covered in the quantisation article , aggressive quantisation dramatically reduces model size and memory bandwidth requirements, making each draft forward pass very cheap.

Other approaches go even simpler. An n-gram model or a retrieval-based lookup table can predict the next few tokens almost instantly (no neural network forward pass at all). These work well in repetitive domains like code generation, where the next several tokens are often predictable from local context (closing brackets, variable names, boilerplate patterns). Their acceptance rate is lower on creative or open-ended text, but the drafting cost is near zero, so even a moderate acceptance rate yields a net speedup.

Several factors affect the acceptance rate beyond just model architecture:

Domain alignment: a draft model fine-tuned on the same domain as the target model's typical input will have much higher acceptance rates. A code-specialised draft model will agree with a code-specialised target model far more often than a generic one would.
Temperature: lower sampling temperatures make both distributions more peaked (concentrated on the top token), which increases the probability that draft and target agree. At temperature 0 (greedy decoding), both models pick their argmax — if they agree on the top token, acceptance is certain. At high temperatures, the distributions spread out and disagreements become more frequent.
Context predictability: some parts of a generation are more predictable than others. Function signatures, closing tags, and formulaic phrases have near-certain continuations that any reasonable draft model will get right. Creative completions, rare factual claims, and reasoning steps are harder to predict.

In practice, the speed vs acceptance tradeoff means there's an optimal draft model size for each target model and workload. A very tiny draft (e.g. a 100M-parameter model drafting for a 70B target) has near-zero drafting cost but a low acceptance rate of perhaps 40-50%. A larger draft (e.g. 7B for 70B) has a higher acceptance rate of perhaps 70-80%, but its forward passes are no longer negligible — $K = 5$ draft passes through a 7B model is a meaningful fraction of one 70B pass. The sweet spot depends on the hardware, the domain, and the target model size. Typical end-to-end speedups reported in the literature range from 2-3x for well-matched draft/target pairs , with some configurations achieving up to 4x on predictable domains.

💡 The speculation length $K$ itself is a tunable parameter. Larger $K$ means more potential tokens per round but also more wasted draft computation when rejection happens early. Some systems adaptively choose $K$ based on recent acceptance rates: increase $K$ when the draft model is on a streak of acceptances, decrease it when rejections are frequent.

Self-Speculative Decoding and Medusa

The biggest practical headache with speculative decoding is maintaining a separate draft model. You need to load both models into GPU memory simultaneously, keep their tokenizers compatible, and figure out which draft model pairs well with your target. What if we could make the target model speculate about its own future tokens, eliminating the need for a separate model entirely?

Medusa (Cai et al., 2024) takes exactly this approach. Instead of using a separate draft model, Medusa adds multiple prediction heads to the target model itself. The standard language model head predicts the next token (position $t+1$). Medusa adds additional lightweight heads — typically small MLPs — that predict tokens at positions $t+2$, $t+3$, $t+4$, and so on, all from the same hidden state at position $t$. These extra heads are trained while the main model weights are either frozen or lightly fine-tuned, so the base model's quality is preserved.

Each Medusa head independently proposes candidate tokens at its respective position. With, say, 3 candidates per head across 4 heads, that creates a tree of candidate sequences rather than a single linear draft. Medusa then uses tree attention to verify all candidate paths in a single forward pass. The attention mask is structured so that each candidate token only attends to its own ancestral path in the tree, not to sibling branches. This lets the model evaluate many possible continuations simultaneously, and the best-matching path is accepted using the same rejection-sampling criterion from standard speculative decoding.

The advantage of Medusa is significant: no separate draft model to load, no tokenizer compatibility issues, and the prediction heads are naturally well-aligned with the target model (they share the same hidden representations). The downside is that training the extra heads requires some compute, and the heads' predictions are based on a single hidden state rather than the autoregressively refined states a draft model would produce, which can limit accuracy for positions far ahead.

EAGLE (Li et al., 2024) takes a different self-speculative approach. Instead of independent heads per position, EAGLE uses a lightweight feature-based predictor — a small autoregressive module that operates on the target model's hidden features (rather than raw token embeddings) to predict future tokens. Because it autoregressively conditions each predicted token on the features of previous predictions, EAGLE captures inter-token dependencies that Medusa's independent heads miss. EAGLE-2 further improves on this by dynamically adjusting the speculation tree structure based on confidence scores, expanding branches that look promising and pruning ones that don't. The authors report speedups of 3-4x while maintaining the exact target distribution.

A conceptually distinct approach is lookahead decoding , which uses Jacobi iteration to generate multiple tokens in parallel without any draft model or extra heads at all. The idea is to initialise multiple future token positions with random guesses, then iteratively refine all positions in parallel using the target model. In each iteration, some positions may "converge" to their correct values (matching what sequential decoding would produce), at which point they're locked in and the window advances. This is mathematically equivalent to running Jacobi iteration on the autoregressive fixed-point equation, treating the sequential token dependencies as a system to be solved in parallel. It's slower to converge than speculative decoding when a good draft model is available, but it requires zero additional components beyond the target model itself.

These self-speculative approaches avoid the fundamental question of "which draft model do I use?" entirely. In practice, EAGLE and Medusa have become popular choices for production deployment because they integrate seamlessly with existing serving infrastructure: you load one model (with a few extra parameters for the speculation heads) and get 2-4x throughput improvement with no changes to the serving architecture.

💡 All of these methods — standard speculative decoding, Medusa, EAGLE, and lookahead decoding — preserve the target model's exact output distribution. The speedup is a genuine free lunch in terms of quality. The tradeoffs are purely in engineering complexity, memory overhead, and how well the speculation matches the target.

Quiz

Test your understanding of speculative decoding and its variants.

Why is verifying $K$ draft tokens in one target model forward pass much cheaper than generating $K$ tokens sequentially?

The target model uses fewer parameters during verification

Verification processes all $K$ tokens in parallel (like prefill), reading model weights from memory once instead of $K$ times

The draft model handles most of the computation during verification

Verification skips the attention mechanism to save time

If the draft model proposes a token with probability $q = 0.6$ and the target model assigns it probability $p = 0.3$, what is the acceptance probability?

0.6

0.3

0.5 (the average of p and q)

1.0 (always accept)

What happens in the worst case when speculative decoding rejects the very first draft token?

The entire round is wasted and no token is produced

We produce one token resampled from the target model — the same as standard decoding, with only small draft overhead

We fall back to using the draft model's token anyway

The system retries with a larger speculation length K

What advantage does Medusa have over standard two-model speculative decoding?

Medusa achieves higher acceptance rates because it uses a larger draft model

Medusa adds prediction heads to the target model itself, eliminating the need to load and maintain a separate draft model

Medusa doesn't need rejection sampling, so it's simpler to implement

Medusa produces higher-quality output than the target model alone