How Do Production Models Handle Long Context?

The previous articles in this track each tackled one piece of the long-context puzzle: position encodings that generalise beyond training lengths, efficient attention patterns that break the quadratic wall, external memory mechanisms that compress past context, and retrieval-augmented generation that offloads knowledge to an external corpus. Each technique solves a real bottleneck, but no single one is sufficient on its own. Production models stack four, five, or six of these ideas together — and the combination is what actually delivers 128K or million-token context windows.

This article examines how the major open and proprietary models achieve their context lengths. The pattern that emerges is consistent: every production system combines a position encoding scheme, an attention efficiency strategy, a KV cache management technique, and usually some form of training recipe that progressively extends context. The details differ, but the layered architecture does not.

LLaMA 3: 128K via RoPE Scaling

Meta's LLaMA 3 family (Grattafiori et al., 2024) is one of the best-documented examples of how a production model assembles its long-context stack. The key insight is that context extension is a post-training step, not something baked in from the start . LLaMA 3 was pre-trained with an 8K context window, then extended to 128K through a series of progressive stages.

The recipe is deceptively simple: start with the 8K pre-trained model, then continue training at 16K, then 32K, then 128K — each stage using a longer context length and adjusting the RoPE base frequency to accommodate the new positions. At each stage, the model sees progressively longer documents and learns to attend over greater distances. This progressive approach is far cheaper than training from scratch at 128K, because most of the model's knowledge was already learned during the 8K pre-training phase — the continued training stages only need to teach the model how to use positions it hasn't seen before.

Why does this work? Recall from article 2 that RoPE encodes position by rotating query and key vectors, and the attention score between positions $m$ and $n$ depends on the rotation angle $\theta_i$ for each dimension $i$:

$$\theta_i = \frac{1}{\text{base}^{2i/d}}$$

where $\text{base}$ is typically 10,000 and $d$ is the head dimension. The relative position $(m - n)$ determines the rotation difference. When we scale $\text{base}$ to a larger value (e.g., 500,000), all the rotation frequencies decrease — the model rotates more slowly per position, so it can represent larger relative distances without the angles wrapping around chaotically. At the boundary: if $\text{base} \to \infty$, every $\theta_i \to 0$, meaning no rotation at all (position information vanishes). If $\text{base} \to 0$, $\theta_i \to \infty$, and the rotations become so rapid that adjacent positions look unrelated. The standard value of 10,000 sits in a sweet spot for the original training length, and scaling it upward shifts that sweet spot to accommodate longer sequences.

But RoPE scaling alone doesn't solve the memory problem. For the 70B model, LLaMA 3 uses Grouped-Query Attention (GQA) with 8 KV heads shared across 64 query heads. This reduces the KV cache by a factor of $64/8 = 8\times$ compared to standard multi-head attention. Without GQA, the KV cache for 128K tokens at FP16 on a 70B model would be approximately:

$$\text{KV}_{\text{MHA}} = 2 \times L \times H \times d_k \times n \times 2 \text{ bytes}$$

With $L = 80$ layers, $H = 64$ heads, $d_k = 128$, and $n = 128{,}000$, that's $2 \times 80 \times 64 \times 128 \times 128{,}000 \times 2 \approx 335$ GB — far more than a single GPU can hold. GQA with 8 KV heads reduces this to $2 \times 80 \times 8 \times 128 \times 128{,}000 \times 2 \approx 42$ GB. At the boundaries: if we had just 1 KV head (multi-query attention), the cache drops to $\sim$5 GB but quality may suffer. If we use the full 64 KV heads (standard MHA), we get the best quality but cannot fit the cache in memory. 8 KV heads is the production compromise.

Finally, LLaMA 3 uses FlashAttention during both training and inference. FlashAttention doesn't reduce the $O(n^2)$ FLOPs — it reduces the memory footprint from $O(n^2)$ to $O(n)$ by never materialising the full attention matrix, instead computing attention in tiles that fit in SRAM. This is what makes 128K contexts feasible on hardware that couldn't store a $128{,}000 \times 128{,}000$ attention matrix.

💡 LLaMA 3's context extension recipe highlights a general pattern: pre-train at a short, cheap context length, then progressively extend. This is far more efficient than training at the target context length from scratch, because the model already knows the language — it just needs to learn to attend over longer distances.

Mistral and Mixtral: Sliding Window + RoPE

Where LLaMA 3 uses full attention at every layer, Mistral 7B (Jiang et al., 2023) takes a fundamentally different approach: sliding window attention (SWA) with a window size of $w = 4{,}096$. Each token only attends to the $w$ tokens immediately before it, not the entire sequence. This means the attention cost per layer is $O(n \cdot w)$ instead of $O(n^2)$ — linear in $n$ when $w$ is a fixed constant.

But doesn't this mean a token at position 50,000 has no direct access to a token at position 1,000? In a single layer, yes. But here's the key insight: in a multi-layer transformer, information propagates through layers. After layer 1, token 50,000 has attended to tokens 45,904–50,000. After layer 2, those tokens have themselves attended to their own windows, so token 50,000 has indirect access to tokens 41,808–50,000. After $L$ layers, the effective receptive field is:

$$\text{receptive field} = L \times w$$

For Mistral 7B with $L = 32$ layers and $w = 4{,}096$:

$$\text{receptive field} = 32 \times 4{,}096 = 131{,}072 \text{ tokens}$$

So a 7B model with only 4K local attention effectively "sees" 131K tokens through layer stacking. At the boundary: if $L = 1$ (single layer), the receptive field is just $w$ — the model is truly local with no long-range access. As $L$ grows, long-range information can propagate, though with increasing attenuation. The trade-off is that information from distant positions passes through many attention steps and may be lossy compared to direct full attention.

The KV cache implementation is elegantly simple: a rolling buffer (circular buffer) of size $w$. When the cache fills up, new entries overwrite the oldest ones. Position $t$ maps to cache index $t \mod w$. This means the KV cache uses a fixed $O(w \cdot d_k \cdot H \cdot L)$ memory regardless of sequence length — a token at position 1,000,000 uses the same cache memory as a token at position 5,000. Compare this to full attention, where the KV cache grows linearly with $n$:

import json, js

# Rolling buffer KV cache vs full KV cache
w = 4096       # sliding window size
d_k = 128      # head dimension
H = 32         # number of KV heads
L = 32         # layers
bytes_per_param = 2  # FP16

# Fixed rolling buffer size
rolling_bytes = w * d_k * H * L * 2 * bytes_per_param  # 2 for K+V
rolling_gb = rolling_bytes / 1e9

seq_lengths = [4096, 32768, 131072, 524288, 1048576]
labels = ["4K", "32K", "128K", "512K", "1M"]

rows = []
for n, label in zip(seq_lengths, labels):
    full_bytes = n * d_k * H * L * 2 * bytes_per_param
    full_gb = full_bytes / 1e9
    saving = (1 - rolling_bytes / full_bytes) * 100
    rows.append([
        label,
        f"{full_gb:.1f} GB",
        f"{rolling_gb:.1f} GB",
        f"{saving:.0f}%" if saving > 0 else "0%"
    ])

js.window.py_table_data = json.dumps({
    "headers": ["Seq Length", "Full KV Cache", "Rolling KV Cache", "Memory Saved"],
    "rows": rows
})

print(f"Config: w={w}, d_k={d_k}, H={H}, L={L}, FP16")
print(f"Rolling buffer is always {rolling_gb:.1f} GB regardless of sequence length")
print(f"At 1M tokens, this saves {rows[-1][3]} of KV cache memory")

Mixtral 8x7B (Jiang et al., 2024) extends this approach by combining the same sliding window attention pattern with a Mixture of Experts (MoE) architecture. Each layer has 8 expert feed-forward networks, and a router selects the top 2 for each token. The attention pattern stays local (same sliding window), but the model's total parameter count (47B) and capacity are much larger than its per-token active parameter count (13B). The attention mechanism handles the "where to look" question with sliding windows, while the MoE handles the "how much capacity" question with sparsely activated experts.

💡 Mistral's key lesson: you don't always need full attention. For many tasks, local attention windows with layer stacking provide sufficient effective receptive field at a fraction of the compute cost. The rolling KV cache is a bonus: fixed memory usage regardless of sequence length.

Gemini: Millions of Tokens

Google's Gemini 1.5 Pro (Gemini Team, 2024) represents the most extreme end of the context length spectrum: up to 10 million tokens of context in research settings, with 1–2 million tokens available in production. This is roughly 7,500 pages of text — an entire codebase, a full book series, or hours of video processed as token sequences.

The architectural details are proprietary, but from the paper and external analysis, several techniques are likely at play. First, Ring Attention or a similar distributed attention scheme for spreading the computation across Google's TPU pods. Ring Attention partitions the sequence across devices, with each device computing attention for its chunk and passing KV blocks in a ring pattern. For a sequence of $n$ tokens distributed across $P$ devices, each device handles $n/P$ tokens and the per-device memory is $O(n^2/P)$ for the attention scores (or $O(n/P)$ with FlashAttention-style tiling). With $P = 256$ TPUs, a 10M-token sequence becomes 39K tokens per device — well within normal operating range.

Second, Gemini 1.5 is a Mixture-of-Experts model, which means that while it may have enormous total capacity, only a fraction of parameters are active per token. This reduces the per-token compute cost, freeing more of the hardware budget for the attention computation that scales with sequence length.

Third, the paper demonstrates remarkable performance on the needle-in-a-haystack evaluation. Gemini 1.5 Pro achieves near-perfect retrieval (>99.7% accuracy) of facts placed at arbitrary positions within contexts of up to 10M tokens. This suggests that whatever attention pattern and training recipe Google uses, it has substantially solved the "lost in the middle" problem described in article 1 — at least for simple factual retrieval tasks.

But this capability comes at enormous cost. Processing 10 million tokens through even an efficient transformer requires staggering compute. If we estimate the attention FLOPs for a single layer with $d_k = 128$ and $H = 16$ (a conservative configuration):

$$\text{FLOPs}_{\text{attn}} = 2 \times H \times n^2 \times d_k = 2 \times 16 \times (10^7)^2 \times 128 \approx 4.1 \times 10^{20}$$

That's $4.1 \times 10^{20}$ FLOPs for one layer of attention alone. A model with 64 layers would need $\sim 2.6 \times 10^{22}$ FLOPs just for attention — roughly equivalent to several hours of compute on a large TPU pod. At the boundary: for $n = 1{,}000$ (a normal prompt), this same formula gives $\sim 4.1 \times 10^{12}$ FLOPs per layer — eight orders of magnitude less. The quadratic wall doesn't disappear; it just gets pushed very far with enough hardware and engineering.

💡 Gemini's key lesson: with sufficient hardware (hundreds of TPUs) and engineering (Ring Attention, MoE, distributed KV cache), the quadratic attention wall can be pushed to millions of tokens. But "can be pushed" is not the same as "is cheap" — this level of context processing is available only to organisations with massive compute budgets.

The Full Stack

Now we can see how all the techniques from this track compose into a modern long-context system. No single technique solves the problem. Instead, production models layer multiple optimisations, each targeting a different bottleneck. Here is the full stack:

  • 1. Position encoding: RoPE with base frequency scaling or YaRN-style NTK-aware interpolation. This is what lets the model represent positions beyond its training length without collapsing. Nearly universal in modern open models.
  • 2. Attention pattern: full attention with FlashAttention (LLaMA, Gemini) or sliding window attention (Mistral). FlashAttention is an IO optimisation that keeps the FLOPs at $O(n^2)$ but avoids the $O(n^2)$ memory cost. Sliding window reduces FLOPs to $O(n \cdot w)$ but trades off direct long-range attention.
  • 3. KV cache management: GQA (fewer KV heads) + quantised cache (INT8 or INT4 values) + PagedAttention (virtual memory for KV blocks). These three together can reduce KV cache memory by 16–32$\times$ compared to naive full-precision multi-head attention.
  • 4. Memory (optional): compressive memory mechanisms like Infini-Attention or Titans for conversations that exceed even the extended context window. These store a fixed-size compressed state that summarises arbitrarily long history. Still experimental at production scale.
  • 5. Retrieval (optional): RAG for knowledge that lives outside the context entirely — enterprise documents, up-to-date web content, private databases. The context window handles the working memory; retrieval handles the long-term knowledge store.
  • 6. Distribution: tensor parallelism across GPUs for the model weights, and Ring Attention or sequence parallelism for distributing the sequence itself across devices. Essential for very long sequences (1M+ tokens) where even a single layer's KV cache exceeds one device's memory.

The table below summarises what each major model uses across these six dimensions:

import json, js

rows = [
    ["LLaMA 3 70B", "128K", "RoPE (scaled base)", "Full + FlashAttn", "GQA (8 KV heads)", "Progressive (8K->128K)"],
    ["LLaMA 3 405B", "128K", "RoPE (scaled base)", "Full + FlashAttn", "GQA (8 KV heads)", "Progressive (8K->128K)"],
    ["Mistral 7B", "32K*", "RoPE", "Sliding window (w=4K)", "GQA + Rolling buffer", "Standard training"],
    ["Mixtral 8x7B", "32K*", "RoPE", "Sliding window (w=4K)", "GQA + Rolling buffer", "Standard training"],
    ["Gemini 1.5 Pro", "1-10M", "Likely RoPE variant", "Full + Ring Attention", "Likely GQA + distributed", "Proprietary recipe"],
    ["GPT-4 Turbo", "128K", "Unknown", "Full + FlashAttn (likely)", "Unknown", "Unknown"],
    ["Claude 3.5", "200K", "Unknown", "Full (likely)", "Unknown", "Unknown"],
]

js.window.py_table_data = json.dumps({
    "headers": ["Model", "Context", "Position Enc.", "Attention", "KV Cache", "Training Recipe"],
    "rows": rows
})

print("* Mistral/Mixtral: 32K sequence length, but effective receptive field of ~131K via layer stacking")
print()
print("Key pattern: every model uses at least 3 of the 6 stack components.")
print("Open models (LLaMA, Mistral) document their choices; proprietary models (GPT-4, Claude, Gemini) are inferred.")

The key insight is that no single technique solves long context . Production systems stack 4–6 optimisations, each targeting a different bottleneck. RoPE scaling solves the position encoding problem but doesn't reduce compute. FlashAttention solves the memory problem but doesn't reduce FLOPs. GQA shrinks the KV cache but doesn't help with attention compute. Sliding window attention reduces compute but limits direct long-range access. Ring Attention distributes the workload but requires multi-device setups. Each technique covers one gap; the combination covers them all.

Open Problems

Despite the impressive progress — from 1K tokens in 2019 to millions in 2024 — several hard problems remain unsolved.

Lost in the middle. As discussed in article 1, Liu et al. (2023) showed that models struggle to retrieve information placed in the middle of long contexts. While some models have improved on needle-in-a-haystack benchmarks (Gemini 1.5 reports >99.7% retrieval across all positions), performance on more complex tasks — multi-hop reasoning, synthesis across distant passages, resolving contradictions between early and late context — remains uneven. The "lost in the middle" problem may be largely solved for simple retrieval but is still open for reasoning-heavy tasks that require integrating information across the full context.

Efficiency at scale. Processing 1 million tokens is technically possible, but it is expensive. Even with FlashAttention and GQA, a forward pass over 1M tokens on a 70B model requires hundreds of TeraFLOPs of attention compute per layer and tens of gigabytes of KV cache. For batch serving (many users simultaneously), these costs multiply. Making million-token inference cheap enough for routine use — not just impressive demos — remains a core engineering challenge. Techniques like speculative decoding, KV cache compression, and dynamic context pruning are active research directions.

Memory quality. Compressive memory systems like Infini-Attention (Google, 2024) and Titans (Google, 2025) offer a theoretically elegant path: store a fixed-size compressed state that grows sublinearly with context length. But these systems face a quality-compression trade-off. A compressed memory that stores $c$ dimensions to represent $n$ tokens necessarily discards information — the compression ratio is $n/c$, and as this ratio grows, retrieval of specific facts from the compressed state degrades. At the boundary: if $c = n \cdot d$ (no compression), we recover full attention. If $c = d$ (maximum compression), we're cramming the entire history into a single vector. Finding the right operating point on this curve, and training models to use compressed memories effectively, remains an open problem. No compressive memory system has been demonstrated at production scale as of early 2025.

Evaluation. How do we even measure long-context quality? The needle-in-a-haystack test is the current standard, but it's too simple — it tests single-fact retrieval, not the complex multi-document reasoning that real use cases demand. Consider the tasks that actually benefit from long context: understanding an entire codebase to fix a bug, synthesising information across hundreds of legal documents, or following a multi-turn conversation that spans weeks. These require the model to integrate information across multiple passages, track entities over long distances, and prioritise relevant details from a sea of noise. We lack standardised benchmarks that measure these capabilities reliably.

The field is evolving fast. Techniques that seem experimental today — neural memory modules, mixture-of-depths architectures, hardware-aware sparse attention — may be standard within a year. What remains constant is the layered approach: production long-context systems will continue to combine multiple orthogonal techniques, because the problem has multiple orthogonal bottlenecks. The position encoding, the attention pattern, the memory management, the training recipe, and the distribution strategy each address a different constraint, and progress on any one of them raises the ceiling for all the others.

Quiz

Test your understanding of how production models combine long-context techniques.

LLaMA 3 was pre-trained at 8K context and extended to 128K. Why is this progressive approach preferred over training at 128K from the start?

Mistral 7B uses sliding window attention with $w = 4{,}096$ and 32 layers. What is its effective receptive field, and why?

LLaMA 3 70B uses GQA with 8 KV heads instead of the full 64. What is the primary benefit for long-context inference?

Which of the following is NOT an open problem in long-context research as of early 2025?