Why Can't Transformers Just Read More?

Every transformer has a fundamental limit: its context window , the maximum number of tokens it can process in a single forward pass. GPT-2 (2019) had a context window of 1,024 tokens — roughly 750 words, barely enough for a short blog post. GPT-4 (2023) pushed this to 128,000 tokens. Gemini 1.5 Pro (2024) claims 10 million. Each generation pushes the boundary further, but why was it so small in the first place? And what makes extending it so hard?

The answer lies in the core operation of every transformer: self-attention . In standard multi-head attention, every token in the sequence computes an attention score against every other token. If the sequence has $n$ tokens, that means $n \times n$ attention scores per head per layer. The cost is quadratic in sequence length — $O(n^2)$ — and this quadratic scaling is the single biggest obstacle to long-context transformers.

To get a feel for how quickly this blows up, consider three context lengths. For $n = 1{,}024$ (GPT-2), each attention head computes $1{,}024 \times 1{,}024 \approx 1$ million scores per layer. Manageable. For $n = 128{,}000$ (GPT-4 Turbo), that jumps to $128{,}000 \times 128{,}000 \approx 16.4$ billion scores per layer per head. For $n = 1{,}000{,}000$ (roughly Gemini's range), it reaches $10^{12}$ — one trillion scores. Going from 1K to 1M tokens multiplied the attention cost by a factor of roughly one million.

This quadratic scaling hits two separate hardware resources. First, compute (FLOPs) : the matrix multiplication $QK^T$ requires $O(n^2 \cdot d)$ floating-point operations per head per layer, where $d$ is the head dimension. Second, memory : the attention score matrix itself is $n \times n$ and must be stored (at least temporarily) during the forward pass. On top of that, the KV cache used during autoregressive generation grows linearly with $n$, but the constant factor is large enough to dominate GPU memory at long sequence lengths. So extending the context window isn't just an algorithmic challenge — it's a hardware budgeting problem.

The Quadratic Attention Cost

Let's put numbers on the cost. Standard scaled dot-product attention computes:

$$\text{Attention}(Q,K,V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

where $Q$, $K$, and $V$ are the query, key, and value matrices of shape $n \times d_k$, and $n$ is the sequence length (for a refresher on how these matrices are built, see the attention scores article ). The softmax normalises each row of scores into a probability distribution. The critical step is the matrix product $QK^T$: it multiplies an $n \times d_k$ matrix by a $d_k \times n$ matrix, producing an $n \times n$ attention score matrix. That $n \times n$ matrix is where the quadratic cost lives.

Let's break down each cost component precisely. For a single attention head in a single layer:

  • Compute for $QK^T$: multiplying $n \times d_k$ by $d_k \times n$ costs $O(n^2 \cdot d_k)$ FLOPs. When $n$ doubles, this quadruples.
  • Compute for $\text{scores} \times V$: multiplying the $n \times n$ score matrix by $V$ (shape $n \times d_k$) costs another $O(n^2 \cdot d_k)$ FLOPs. Also quadratic in $n$.
  • Memory for attention matrix: the full $n \times n$ matrix must be stored — that's $O(n^2)$ elements per head. In practice, FlashAttention (Dao et al., 2022) avoids materialising this entire matrix by computing attention in tiles, reducing peak memory from $O(n^2)$ to $O(n)$. But the FLOPs remain $O(n^2 \cdot d_k)$ — FlashAttention is an IO optimisation, not an algorithmic one.
  • KV cache memory: during autoregressive generation, the model stores all previously computed key and value vectors. For $L$ layers, $H$ heads, and head dimension $d_k$, this costs $O(n \cdot d_k \cdot H \cdot L)$ — linear in $n$, but the constant is substantial. For a 70B-parameter model with 80 layers and 64 heads, storing the KV cache for 128K tokens in FP16 requires roughly 40 GB (see the KV cache article) .

The practical consequence is stark: doubling the context length quadruples the attention compute and doubles the KV cache memory . The table below makes this concrete by showing how both scale across the context lengths that matter in practice.

import json, js

# Attention compute and KV cache scaling across context lengths
# Using d_k=128, H=32 heads, L=32 layers as a representative config (roughly a 7B model)
d_k = 128
H = 32
L = 32

seq_lengths = [1024, 4096, 16384, 65536, 131072, 1048576]
labels = ["1K", "4K", "16K", "64K", "128K", "1M"]

rows = []
for n, label in zip(seq_lengths, labels):
    # Attention FLOPs per layer (all heads): 2 * H * n^2 * d_k (QK^T + scores*V)
    attn_flops = 2 * H * (n ** 2) * d_k
    # KV cache memory in bytes (FP16 = 2 bytes): 2 (K+V) * L * H * d_k * n * 2 bytes
    kv_bytes = 2 * L * H * d_k * n * 2

    # Format FLOPs
    if attn_flops >= 1e18:
        flops_str = f"{attn_flops / 1e18:.1f} ExaFLOPs"
    elif attn_flops >= 1e15:
        flops_str = f"{attn_flops / 1e15:.1f} PetaFLOPs"
    elif attn_flops >= 1e12:
        flops_str = f"{attn_flops / 1e12:.1f} TeraFLOPs"
    elif attn_flops >= 1e9:
        flops_str = f"{attn_flops / 1e9:.1f} GigaFLOPs"
    else:
        flops_str = f"{attn_flops / 1e6:.1f} MegaFLOPs"

    # Format KV cache
    if kv_bytes >= 1e9:
        kv_str = f"{kv_bytes / 1e9:.1f} GB"
    elif kv_bytes >= 1e6:
        kv_str = f"{kv_bytes / 1e6:.1f} MB"
    else:
        kv_str = f"{kv_bytes / 1e3:.1f} KB"

    # n^2 attention scores
    scores = n * n
    if scores >= 1e12:
        scores_str = f"{scores / 1e12:.1f}T"
    elif scores >= 1e9:
        scores_str = f"{scores / 1e9:.1f}B"
    elif scores >= 1e6:
        scores_str = f"{scores / 1e6:.1f}M"
    else:
        scores_str = f"{scores / 1e3:.1f}K"

    rows.append([label, f"{n:,}", scores_str, flops_str, kv_str])

js.window.py_table_data = json.dumps({
    "headers": ["Context", "Tokens (n)", "Attn Scores (n\u00b2)", "Attn FLOPs/layer", "KV Cache (FP16)"],
    "rows": rows
})

print("Config: d_k=128, H=32 heads, L=32 layers (approx. 7B model)")
print()
print("Key observations:")
print("  1K -> 4K   (4x tokens): attention FLOPs grow 16x, KV cache grows 4x")
print("  1K -> 128K (128x tokens): attention FLOPs grow 16,384x, KV cache grows 128x")
print("  1K -> 1M   (1024x tokens): attention FLOPs grow ~1,000,000x, KV cache grows 1,024x")
💡 The table above shows FLOPs for a single layer. A 32-layer model multiplies these numbers by 32, and a 128K context model would perform roughly 268 TeraFLOPs of attention computation alone per forward pass — before counting the feed-forward layers, which add another comparable chunk of compute.

Let's verify the scaling relationship with a quick boundary check. If we go from $n$ to $2n$, the attention scores go from $n^2$ to $(2n)^2 = 4n^2$ — a $4\times$ increase. The KV cache goes from $n \cdot c$ to $2n \cdot c$ where $c = d_k \cdot H \cdot L$ is constant — a $2\times$ increase. So attention compute is quadratic in $n$ (doubling $n$ costs $4\times$), while KV cache is linear (doubling $n$ costs $2\times$). At moderate context lengths, the KV cache dominates memory because FlashAttention avoids materialising the $n^2$ attention matrix. But at very long contexts (1M+), even the linear KV cache term becomes enormous — 512 GB for 1M tokens in the configuration above.

Three Strategies for Long Context

If quadratic attention is the wall, what tools do we have to get past it? Over the rest of this track, we'll explore three fundamentally different strategies, each making a different trade-off.

The first strategy is better position encodings (article 2). Standard transformers learn a fixed set of position embeddings during training — one per position up to the maximum context length. A model trained with 4,096 positions has never seen position 4,097 and has no idea what to do with it. Techniques like RoPE (Rotary Position Embedding), ALiBi (Attention with Linear Biases), and YaRN (Yet another RoPE extensioN) replace or modify the position encoding so the model can generalise to positions it was never trained on. The trade-off is extrapolation quality : the model might handle longer sequences, but attention quality can degrade at positions far beyond the training distribution.

The second strategy is efficient attention patterns (article 3). Instead of every token attending to every other token ($O(n^2)$), we restrict attention to a subset: a sliding window of $w$ nearby tokens, a fixed set of global tokens, or a sparse pattern determined by hashing or learned routing. This reduces the cost from $O(n^2)$ to $O(n \cdot w)$ where $w \ll n$ — often $O(n \log n)$ or even $O(n)$. Examples include Longformer , BigBird , and Mistral's sliding-window attention. The trade-off is full attention access : tokens can only attend to their local neighbourhood, so distant information must propagate through multiple layers rather than being directly accessible in one attention step.

The third strategy is external memory (article 4). Instead of cramming everything into the attention window, give the model an external memory bank — a fixed-size compressed representation of past context that it can read from and write to. This is the approach taken by Titans (Google, 2025), Infini-Attention (Google, 2024), and Memorizing Transformers (Google, 2022). The trade-off is exact retrieval for compressed storage : the model stores a compressed summary rather than the raw tokens, so it may not retrieve fine-grained details as precisely as full attention would.

Beyond these three core strategies, article 5 covers scaling tricks — engineering techniques like progressive training, context length warm-up, and continued pre-training that let existing models adapt to longer contexts without training from scratch. Article 6 examines how production models combine these approaches : modern long-context systems typically stack multiple strategies (e.g., RoPE scaling + sliding-window attention + FlashAttention) rather than relying on any single technique.

To summarise the landscape:

  • Position encodings — extend where the model can look (trade extrapolation quality)
  • Sparse attention — reduce how much the model computes (trade full attention access)
  • External memory — expand what the model can remember (trade exact retrieval for compression)

What Happens When Context Gets Long?

Suppose we solve the compute and memory problems — we build a model that can fit a million tokens in its context window without running out of GPU memory or taking hours per forward pass. Does it actually use all that context effectively? The answer, surprisingly, is: not always.

In 2023, Liu et al. published a paper titled "Lost in the Middle: How Language Models Use Long Contexts" that exposed a striking pattern. They gave models a multi-document question-answering task where the relevant document was placed at different positions within a long context. Models performed well when the relevant information appeared at the very beginning or the very end of the context, but performance dropped significantly when it was buried in the middle. This U-shaped performance curve held across multiple models and context lengths.

Why does this happen? The likely explanation involves how attention distributes across positions. In autoregressive transformers, the first few tokens tend to receive disproportionately high attention weights (sometimes called "attention sinks"), and the most recent tokens are naturally salient because they are closest in the causal attention window. Tokens in the middle get squeezed: they're far from the start (so they've lost their primacy advantage) and far from the end (so they lack recency). The softmax normalisation in attention means attention is zero-sum — giving more weight to boundary positions necessarily takes weight away from middle positions.

This finding led to the needle-in-a-haystack evaluation paradigm, which has become the standard stress test for long-context models. The setup is simple: insert a specific fact (the "needle") at a controlled position within a long passage of irrelevant text (the "haystack"), then ask the model to retrieve that fact. By varying both the total context length and the needle's position, we get a 2D heatmap of retrieval accuracy. A perfect long-context model would show uniform green across the entire heatmap — it retrieves the needle regardless of where it's placed or how long the context is.

In practice, most models show degraded retrieval in the middle positions, especially at longer context lengths. Some newer models (GPT-4o, Claude 3.5, Gemini 1.5 Pro) have substantially improved on this benchmark through training-time interventions, but the general lesson remains: the quality challenge is orthogonal to the compute challenge . Solving the $O(n^2)$ scaling problem gets us a model that can technically process a million tokens. But whether the model actually attends to and retrieves information uniformly across all those tokens is a separate problem that requires training-time fixes (like long-context fine-tuning with diverse needle positions) and architectural innovations (like the efficient attention and memory mechanisms we'll cover in articles 3 and 4).

This distinction — between being able to process long contexts and being able to use them — is what makes the long-context problem so multifaceted. Extending the context window requires solving at least three sub-problems simultaneously: the quadratic compute wall, the linear-but-large memory wall, and the quality degradation that comes from attention's difficulty in distributing focus uniformly over very long sequences. The rest of this track tackles each in turn.

Quiz

Test your understanding of the context length problem and its implications.

If a model's context length increases from 4,096 to 16,384 tokens (a 4$\times$ increase), by what factor does the attention computation ($QK^T$) increase?

FlashAttention reduces the peak memory usage of attention from $O(n^2)$ to $O(n)$. What does it NOT reduce?

In the 'Lost in the Middle' finding (Liu et al., 2023), where in a long context do models perform worst at retrieving relevant information?

Which trade-off is associated with the sparse attention strategy for long context?