What If We Could Guess Multiple Tokens at Once?
Every optimisation we've covered so far — KV caching , quantisation , continuous batching — accepts a fundamental constraint: autoregressive decoding generates one token per forward pass of the large model. Each forward pass reads ALL model weights from GPU memory, performs a small matrix-vector multiply for a single token, and writes the result back. As we established in article 1 , this makes decode memory-bandwidth-bound : the GPU spends most of its time shuttling weights from HBM rather than doing useful arithmetic. We've reduced the size of those weights (quantisation) and amortised the cost across requests (batching), but we're still generating one token at a time.
But what if we could break that one-token-per-pass constraint? What if a small, fast model could guess the next several tokens, and then the large model could verify all of those guesses in a single forward pass? Verifying multiple tokens at once is cheap — it looks just like prefill (processing a known sequence in one batched pass), which is compute-bound and efficient, unlike decode's memory-bound one-token-at-a-time grind.
That's speculative decoding . The small model drafts , the large model edits . Two independent papers introduced the idea nearly simultaneously: Leviathan et al. called it "speculative decoding" (Leviathan et al., 2023) , while Chen et al. framed it as "speculative sampling" (Chen et al., 2023) . Both arrived at the same core algorithm and, crucially, the same guarantee: the output distribution is mathematically identical to running the large model alone. There is no quality loss. It's as close to a free lunch as inference optimisation gets.
How Speculative Decoding Works
The algorithm runs in rounds. Each round has three steps: draft, verify, and accept-or-reject. Let's call the large model the target model (the one whose output quality we want) and the small model the draft model (the one that guesses ahead). We choose a speculation length $K$ — the number of tokens the draft model will guess per round.
Step 1 — Draft: the draft model generates $K$ candidate tokens autoregressively, one at a time: $\hat{y}_1, \hat{y}_2, \ldots, \hat{y}_K$. Because the draft model is small (perhaps 10-100x fewer parameters than the target), each of these $K$ forward passes is very fast. At each position $i$, the draft model also records the probability it assigned to its chosen token, $q_i(\hat{y}_i)$, from its output distribution $q_i$.
Step 2 — Verify: we append all $K$ draft tokens to the current context and feed the entire extended sequence through the target model in a single forward pass . This is the key insight: the target model processes all $K$ candidate positions in parallel, exactly like prefill . We read the model weights from memory once, and compute output distributions $p_1, p_2, \ldots, p_K$ (plus one extra position $p_{K+1}$ for the token after the last draft token) in that single pass. This one forward pass of the target model replaces what would otherwise be $K$ separate sequential forward passes.
Step 3 — Accept or reject: we scan the $K$ positions from left to right. At each position $i$, we compare the target model's probability $p_i(\hat{y}_i)$ against the draft model's probability $q_i(\hat{y}_i)$ using a rejection-sampling criterion (detailed in the next section). If the token is accepted, we move to position $i+1$. The moment a token is rejected, we stop — we resample that position from the target model's distribution (with a correction to preserve exactness), discard all subsequent draft tokens, and begin a new round from the corrected position.
The cost of one round is therefore:
where $C_{\text{draft}}$ is the cost of one draft model forward pass and $C_{\text{target}}$ is the cost of one target model forward pass. Since the draft model is much smaller, $C_{\text{draft}} \ll C_{\text{target}}$, so $K \cdot C_{\text{draft}}$ is often negligible compared to $C_{\text{target}}$. In the best case, all $K$ tokens are accepted, plus we get one bonus token from the target model's distribution at position $K+1$. That gives us $K + 1$ tokens for roughly the cost of one target forward pass — a $(K+1)\times$ speedup over standard decoding.
In the worst case, the very first draft token is rejected. We still get one token (resampled from the target model at position 1), which is exactly what standard decoding would have produced. The only overhead is the wasted draft computation $K \cdot C_{\text{draft}}$, which is small. So speculative decoding is never significantly worse than standard decoding and can be up to $(K+1)\times$ better.
Here's pseudocode that captures the full loop:
def speculative_decode(target_model, draft_model, prompt, K, max_tokens):
"""Generate tokens using speculative decoding."""
tokens = list(prompt)
while len(tokens) - len(prompt) < max_tokens:
# ── Step 1: Draft ──────────────────────────────────────────
draft_tokens = []
draft_probs = []
draft_context = list(tokens)
for _ in range(K):
q = draft_model.get_distribution(draft_context) # draft dist
y_hat = sample(q)
draft_tokens.append(y_hat)
draft_probs.append(q[y_hat])
draft_context.append(y_hat)
# ── Step 2: Verify ─────────────────────────────────────────
# Single forward pass: target processes all K draft tokens at once
# Returns distributions p_1, ..., p_K, p_{K+1}
target_dists = target_model.verify(tokens, draft_tokens)
# ── Step 3: Accept / Reject ────────────────────────────────
accepted = 0
for i in range(K):
p_i = target_dists[i][draft_tokens[i]] # target prob
q_i = draft_probs[i] # draft prob
# Rejection sampling: accept with min(1, p/q)
if random.random() < min(1.0, p_i / q_i):
tokens.append(draft_tokens[i])
accepted += 1
else:
# Resample from corrected distribution
corrected = max(0, target_dists[i] - draft_probs_dist[i])
corrected /= corrected.sum()
tokens.append(sample(corrected))
break # discard remaining draft tokens
else:
# All K accepted — bonus token from p_{K+1}
tokens.append(sample(target_dists[K]))
return tokens
The Acceptance Criterion
The magic of speculative decoding is that despite using a cheap draft model, the final output distribution is provably identical to what the target model would produce on its own. No approximation, no quality loss. This guarantee comes from the acceptance criterion, which is a form of rejection sampling — a classical technique from computational statistics adapted here for autoregressive generation.
At each draft position $i$, the draft model proposed token $\hat{y}_i$ with probability $q_i(\hat{y}_i)$ from its distribution $q_i$, and the target model assigns probability $p_i(\hat{y}_i)$ from its distribution $p_i$ (computed during the verification pass). The acceptance probability is:
Let's examine the boundary cases. When $p_i(\hat{y}_i) \geq q_i(\hat{y}_i)$ — the target model assigns equal or higher probability to the drafted token than the draft model did — the ratio $p/q \geq 1$, so $\min(1, p/q) = 1$. The token is always accepted . This happens when the draft model and target model agree, or when the target is even more confident about this token than the draft was. There's no reason to reject a token the target model likes.
When $p_i(\hat{y}_i) < q_i(\hat{y}_i)$ — the target model assigns lower probability than the draft model — the ratio $p/q < 1$, so we accept with probability $p/q$. The draft model was overconfident about this token relative to the target, and we reject proportionally more often. If the draft assigned probability 0.8 but the target assigns 0.2, we accept with probability $0.2/0.8 = 0.25$ and reject 75% of the time. If the target assigns probability 0.0 (this token should never appear), we always reject ($0/0.8 = 0$). Conversely, if $p$ and $q$ are close, say 0.3 and 0.4, acceptance probability is $0.3/0.4 = 0.75$ — fairly high, reflecting mild disagreement.
When a token is rejected, we don't just sample from the target distribution $p_i$ directly (that would bias toward tokens the draft model already favoured). Instead, we sample from a corrected distribution that accounts for the tokens that were already accepted via the rejection-sampling process:
This corrected distribution upweights tokens where the target model assigns more probability than the draft model and completely zeroes out tokens where the draft model was more confident. The $(p - q)^+$ numerator captures the "residual" probability mass that the draft model underweighted, and normalisation turns it into a valid distribution. The mathematical proof that this combination of acceptance criterion plus corrected resampling preserves the exact target distribution can be found in both original papers — the intuition is that it's a standard rejection-sampling scheme where the proposal distribution is $q$ and the target distribution is $p$.
How many tokens do we expect to accept per round? If we denote the per-position acceptance rate as $\alpha_j$ (averaged over the token distribution), the expected number of accepted tokens is:
This formula reflects the left-to-right sequential structure of acceptance: to accept position $i$, we must have also accepted all positions $1, 2, \ldots, i-1$. The product $\prod_{j=1}^{i} \alpha_j$ is the probability that the first $i$ tokens are all accepted, and summing over $i$ gives the expected count.
Let's check the boundaries. If the draft model is perfect ($q = p$ everywhere), then $\alpha_j = \min(1, p/q) = 1$ for all $j$, every product is 1, and the sum equals $K$. All $K$ tokens are always accepted, yielding a $(K+1)\times$ speedup (including the bonus token). This is the theoretical maximum, achieved when the draft model perfectly mimics the target.
If the draft model is terrible ($q$ very different from $p$, so $\alpha_j \approx 0$), then even the first product $\alpha_1 \approx 0$, the sum approaches 0, and we accept almost nothing. We still get 1 token per round (the resampled correction), which is what standard decoding produces. The only penalty is the wasted draft computation. So the speedup degrades gracefully from $(K+1)\times$ down toward $1\times$ as draft quality decreases — it never makes things significantly worse.
For a simplified analysis where $\alpha_j = \alpha$ is constant across positions, the expected accepted count becomes the geometric series:
At $\alpha = 0.9$ (a well-matched draft/target pair) with $K = 5$, this gives $0.9 + 0.81 + 0.729 + 0.6561 + 0.5905 \approx 3.69$ tokens accepted per round, plus the bonus token from the target model, for roughly 4.69 tokens per round — close to a $5\times$ speedup if we ignore draft overhead. At $\alpha = 0.5$ (mediocre draft) with $K = 5$, we get $0.5 + 0.25 + 0.125 + 0.0625 + 0.03125 \approx 0.97$ tokens, so about 2 tokens per round including the bonus — roughly $2\times$.
import json, js
K = 5 # speculation length
# Sweep alpha from 0 to 1
alphas = [i / 100 for i in range(0, 101)]
expected_accepted = []
speedups = []
for alpha in alphas:
# E[accepted] = sum of products alpha^1, alpha^2, ..., alpha^K
total = sum(alpha**i for i in range(1, K + 1))
expected_accepted.append(round(total, 3))
# Total tokens per round = accepted + 1 (bonus token)
# Speedup = tokens per round vs 1 token in standard decoding
speedups.append(round(total + 1, 3))
# Sample every 5 points for readability
step = 5
x_labels = [f"{alphas[i]:.2f}" for i in range(0, len(alphas), step)]
y_accepted = [expected_accepted[i] for i in range(0, len(alphas), step)]
y_speedup = [speedups[i] for i in range(0, len(alphas), step)]
plot_data = [
{
"title": f"Speculative Decoding Speedup (K={K})",
"x_label": "Acceptance Rate (alpha)",
"y_label": "Tokens per Round",
"x_data": x_labels,
"lines": [
{"label": "Expected Accepted Tokens", "data": y_accepted, "color": "#3b82f6"},
{"label": "Total Tokens (accepted + bonus)", "data": y_speedup, "color": "#22c55e"},
]
}
]
js.window.py_plot_data = json.dumps(plot_data)
print(f"Speculation length K = {K}")
print(f"alpha=0.0: {expected_accepted[0]:.1f} accepted, {speedups[0]:.1f}x total (worst case)")
print(f"alpha=0.5: {expected_accepted[50]:.2f} accepted, {speedups[50]:.2f}x total")
print(f"alpha=0.7: {expected_accepted[70]:.2f} accepted, {speedups[70]:.2f}x total")
print(f"alpha=0.9: {expected_accepted[90]:.2f} accepted, {speedups[90]:.2f}x total")
print(f"alpha=1.0: {expected_accepted[100]:.1f} accepted, {speedups[100]:.1f}x total (best case)")
The plot shows the nonlinear relationship between draft quality and speedup. At $\alpha = 0.7$, we already get roughly $3.3\times$ total tokens per round — a substantial gain. But below $\alpha = 0.5$, returns diminish sharply, and the speedup approaches the $1\times$ baseline. This curve makes the choice of draft model critically important: a small improvement in acceptance rate translates to a large improvement in throughput.
Choosing the Draft Model
The draft model must satisfy two competing requirements: it must be fast enough that $K$ draft forward passes don't dominate the cost of one target forward pass, and it must be accurate enough that a high fraction of its guesses are accepted. If the draft model is too slow, the overhead of drafting erodes the savings from parallel verification. If it's too inaccurate, most tokens get rejected and we fall back to essentially standard decoding.
The most common approach is to use a smaller model from the same family . For example, LLaMA-7B or LLaMA-13B can serve as a draft model for LLaMA-70B. Models from the same family share the same tokenizer and were trained on similar data, so their output distributions tend to agree on high-probability tokens (common continuations, factual completions, syntactic patterns). A smaller model from a completely different family with a different tokenizer won't work — the vocabulary mismatch alone makes comparison meaningless.
A second option is to use a quantised version of the target model itself as the draft model. If the target runs in FP16, a heavily quantised 2-bit or 3-bit version of the same model can serve as the drafter. The distributions are naturally well-aligned because it's the same model (just compressed), and as we covered in the quantisation article , aggressive quantisation dramatically reduces model size and memory bandwidth requirements, making each draft forward pass very cheap.
Other approaches go even simpler. An n-gram model or a retrieval-based lookup table can predict the next few tokens almost instantly (no neural network forward pass at all). These work well in repetitive domains like code generation, where the next several tokens are often predictable from local context (closing brackets, variable names, boilerplate patterns). Their acceptance rate is lower on creative or open-ended text, but the drafting cost is near zero, so even a moderate acceptance rate yields a net speedup.
Several factors affect the acceptance rate beyond just model architecture:
- Domain alignment: a draft model fine-tuned on the same domain as the target model's typical input will have much higher acceptance rates. A code-specialised draft model will agree with a code-specialised target model far more often than a generic one would.
- Temperature: lower sampling temperatures make both distributions more peaked (concentrated on the top token), which increases the probability that draft and target agree. At temperature 0 (greedy decoding), both models pick their argmax — if they agree on the top token, acceptance is certain. At high temperatures, the distributions spread out and disagreements become more frequent.
- Context predictability: some parts of a generation are more predictable than others. Function signatures, closing tags, and formulaic phrases have near-certain continuations that any reasonable draft model will get right. Creative completions, rare factual claims, and reasoning steps are harder to predict.
In practice, the speed vs acceptance tradeoff means there's an optimal draft model size for each target model and workload. A very tiny draft (e.g. a 100M-parameter model drafting for a 70B target) has near-zero drafting cost but a low acceptance rate of perhaps 40-50%. A larger draft (e.g. 7B for 70B) has a higher acceptance rate of perhaps 70-80%, but its forward passes are no longer negligible — $K = 5$ draft passes through a 7B model is a meaningful fraction of one 70B pass. The sweet spot depends on the hardware, the domain, and the target model size. Typical end-to-end speedups reported in the literature range from 2-3x for well-matched draft/target pairs , with some configurations achieving up to 4x on predictable domains.
Self-Speculative Decoding and Medusa
The biggest practical headache with speculative decoding is maintaining a separate draft model. You need to load both models into GPU memory simultaneously, keep their tokenizers compatible, and figure out which draft model pairs well with your target. What if we could make the target model speculate about its own future tokens, eliminating the need for a separate model entirely?
Medusa (Cai et al., 2024) takes exactly this approach. Instead of using a separate draft model, Medusa adds multiple prediction heads to the target model itself. The standard language model head predicts the next token (position $t+1$). Medusa adds additional lightweight heads — typically small MLPs — that predict tokens at positions $t+2$, $t+3$, $t+4$, and so on, all from the same hidden state at position $t$. These extra heads are trained while the main model weights are either frozen or lightly fine-tuned, so the base model's quality is preserved.
Each Medusa head independently proposes candidate tokens at its respective position. With, say, 3 candidates per head across 4 heads, that creates a tree of candidate sequences rather than a single linear draft. Medusa then uses tree attention to verify all candidate paths in a single forward pass. The attention mask is structured so that each candidate token only attends to its own ancestral path in the tree, not to sibling branches. This lets the model evaluate many possible continuations simultaneously, and the best-matching path is accepted using the same rejection-sampling criterion from standard speculative decoding.
The advantage of Medusa is significant: no separate draft model to load, no tokenizer compatibility issues, and the prediction heads are naturally well-aligned with the target model (they share the same hidden representations). The downside is that training the extra heads requires some compute, and the heads' predictions are based on a single hidden state rather than the autoregressively refined states a draft model would produce, which can limit accuracy for positions far ahead.
EAGLE (Li et al., 2024) takes a different self-speculative approach. Instead of independent heads per position, EAGLE uses a lightweight feature-based predictor — a small autoregressive module that operates on the target model's hidden features (rather than raw token embeddings) to predict future tokens. Because it autoregressively conditions each predicted token on the features of previous predictions, EAGLE captures inter-token dependencies that Medusa's independent heads miss. EAGLE-2 further improves on this by dynamically adjusting the speculation tree structure based on confidence scores, expanding branches that look promising and pruning ones that don't. The authors report speedups of 3-4x while maintaining the exact target distribution.
A conceptually distinct approach is lookahead decoding , which uses Jacobi iteration to generate multiple tokens in parallel without any draft model or extra heads at all. The idea is to initialise multiple future token positions with random guesses, then iteratively refine all positions in parallel using the target model. In each iteration, some positions may "converge" to their correct values (matching what sequential decoding would produce), at which point they're locked in and the window advances. This is mathematically equivalent to running Jacobi iteration on the autoregressive fixed-point equation, treating the sequential token dependencies as a system to be solved in parallel. It's slower to converge than speculative decoding when a good draft model is available, but it requires zero additional components beyond the target model itself.
These self-speculative approaches avoid the fundamental question of "which draft model do I use?" entirely. In practice, EAGLE and Medusa have become popular choices for production deployment because they integrate seamlessly with existing serving infrastructure: you load one model (with a few extra parameters for the speculation heads) and get 2-4x throughput improvement with no changes to the serving architecture.
Quiz
Test your understanding of speculative decoding and its variants.
Why is verifying $K$ draft tokens in one target model forward pass much cheaper than generating $K$ tokens sequentially?
If the draft model proposes a token with probability $q = 0.6$ and the target model assigns it probability $p = 0.3$, what is the acceptance probability?
What happens in the worst case when speculative decoding rejects the very first draft token?
What advantage does Medusa have over standard two-model speculative decoding?