Retrieval-Augmented Context

Long Context vs Retrieval: Do We Still Need RAG?

Context windows have grown from 1,024 tokens (GPT-2, 2019) to over 1 million tokens (Gemini 1.5 Pro, 2024). If a model can ingest an entire codebase or a full book in one shot, why bother with a retrieval pipeline at all? Can't we just dump everything into the context and let the model sort it out?

The case for long context alone is compelling on the surface. The pipeline is simpler: no chunking, no embedding, no index, no retrieval errors. The model sees everything, so it can't miss a relevant passage because a retriever failed to surface it. And there's no information loss from chunking boundaries — the model has the full document with all cross-references intact.

But even 1 million tokens has limits. A million tokens is roughly 750,000 words — a long book, but a tiny fraction of a corporate knowledge base (which may contain millions of documents), a legal archive, or the entire internet. Beyond scale, there are three practical problems. First, cost : processing 1M tokens through a transformer's attention layers is expensive. As we covered in article 1, attention cost scales as $O(n^2)$, so processing 1M tokens costs roughly $1{,}000{,}000^2 = 10^{12}$ attention scores per head per layer. Second, latency : filling a 1M-token context window takes seconds to minutes, while a retrieval lookup takes milliseconds. Third, quality : as we saw with the "Lost in the Middle" phenomenon (article 1), models struggle to attend uniformly across very long contexts. Relevant information buried at position 500,000 may be effectively invisible.

Retrieval solves all three. Instead of processing the entire corpus, we retrieve just the 5-20 most relevant passages (perhaps 2,000-10,000 tokens total) and place them in the context. The model processes a small, targeted context rather than a massive unfocused one. The cost drops by orders of magnitude, latency is minimal, and every token the model sees is likely relevant.

The practical answer, confirmed empirically, is that they complement each other . (Xu et al., 2024) studied this directly in "Retrieval meets Long Context Large Language Models" and found that retrieval-augmented generation consistently improves performance even when the model has a long enough context window to fit all the documents. Long context handles working memory — the conversation so far, the document you're analysing, the code you're editing. Retrieval handles the library — the vast corpus of knowledge you might need to consult but would never load entirely into working memory.

💡 Think of it like human cognition. You don't memorise every book you've ever read (that would be stuffing everything into your context window). Instead, you remember enough to know which book to pull off the shelf (retrieval), then read the relevant chapter carefully (long context). The combination is more effective than either alone.

Retrieval-Augmented Attention

Traditional RAG prepends retrieved documents to the prompt: the retriever finds relevant chunks, we concatenate them into the context, and the model generates a response conditioned on that augmented prompt. This works, but it consumes context window tokens. If we retrieve 10 passages of 500 tokens each, that's 5,000 tokens of context budget spent before the user's question even appears. What if we could integrate retrieval directly into the attention mechanism itself?

That's the idea behind RETRO (Retrieval-Enhanced Transformer), introduced by (Borgeaud et al., 2022) . Instead of prepending retrieved text to the input, RETRO retrieves relevant chunks from an external database and injects them into the model via cross-attention layers interleaved every few transformer blocks. The retrieved chunks never enter the main context window — they're attended to through a separate pathway.

Here's how the architecture works. The input sequence is split into chunks of $m$ tokens (typically $m = 64$). For each chunk, a frozen BERT encoder computes a representation, and a $k$-nearest-neighbour lookup retrieves the top-$k$ most similar chunks from a precomputed database. Each retrieved chunk is encoded by a separate encoder stack (the "retrieval encoder"), producing key-value pairs. Then, every few layers in the main transformer, a chunked cross-attention (CCA) layer lets each input chunk attend to its corresponding retrieved neighbours:

\text{CCA}(H, E) = \text{softmax}\!\left(\frac{H \, W_Q \;(E \, W_K)^T}{\sqrt{d_k}}\right) E \, W_V

where $H \in \mathbb{R}^{m \times d}$ is the hidden state for one input chunk (the $m$ tokens in that chunk) and $E \in \mathbb{R}^{(k \cdot r) \times d}$ contains the encoded representations of all $k$ retrieved neighbours (each of length $r$ tokens). The queries come from the input chunk; the keys and values come from the retrieved chunks. This means each input token attends only to its chunk's retrieved neighbours, not to the entire database.

Let's check the dimensions. $H \, W_Q$ has shape $m \times d_k$, and $(E \, W_K)^T$ has shape $d_k \times (k \cdot r)$. Their product is $m \times (k \cdot r)$ — each of the $m$ input tokens produces a score against each of the $k \cdot r$ retrieved tokens. With $m = 64$, $k = 2$ neighbours, and $r = 64$ tokens per neighbour, the attention matrix is $64 \times 128 = 8{,}192$ entries. Compare that to standard self-attention over a 1M-token sequence: $10^{12}$ entries. The cross-attention cost is negligible.

The key insight is scale. The RETRO database in the original paper contained 1.75 trillion tokens (the MassiveText dataset from DeepMind). That's roughly 10,000 times larger than the largest context window available today. The model accesses this knowledge at a cost proportional to the number of retrieved chunks, not the database size. A 7.5B parameter RETRO model matched the performance of a 25x larger model (175B parameters) on certain knowledge-intensive benchmarks — it effectively traded parameters for retrieval.

How does this differ from Memorizing Transformers (article 4)? Memorizing Transformers retrieve from the model's own past context — earlier tokens from the same sequence that have fallen out of the local attention window. RETRO retrieves from an external database of pre-indexed documents that may have nothing to do with the current input. Memorizing Transformers extend the model's memory of what it has already seen; RETRO gives the model access to knowledge it has never seen in the current context.

💡 RETRO's chunked cross-attention is reminiscent of how an encoder-decoder model (like the original Transformer for machine translation) uses cross-attention to attend to the source sentence. The difference is that RETRO's "source" is a dynamically retrieved set of database chunks, different for each input chunk.

When to Retrieve vs When to Extend Context

Given that long context and retrieval are complementary, when should you use which? The decision depends on the nature of the knowledge, the task, and the cost constraints. Here's a practical framework.

Static knowledge base (company docs, manuals, legal archives) — use retrieval. The corpus is too large for any context window, and the relevant subset changes per query. A dense retrieval pipeline with reranking surfaces the right 5-20 passages per query from millions of documents.
Multi-turn conversation — use long context. The conversation history IS the context. You want the model to remember everything said so far, and the history is typically well within context window limits (most conversations are under 10K tokens).
Document analysis (read a 100-page PDF and answer questions) — use long context if it fits, retrieval if it doesn't. A 100-page PDF is roughly 40,000 tokens, well within a 128K window. A 10,000-page regulatory archive is not.
Real-time information — use retrieval. The model's weights are frozen at training time and its context window starts empty each session. Today's news, live stock prices, or current weather data must be retrieved from external sources.

The hybrid approach combines both: retrieve the most relevant chunks from a large corpus, place them in the long context window alongside the user's query and conversation history, and let the model attend to everything. This gives you the precision of retrieval (finding the right needle) with the comprehension of long context (understanding the needle in context). Most production systems today use exactly this pattern.

The cost difference is substantial. Let's compare processing 100,000 tokens of raw context (stuffing a corpus into the window) versus retrieving 5 relevant passages of 500 tokens each (2,500 tokens total).

import json, js

# Compare cost of long-context vs retrieval-augmented approach
# Using d_k=128, H=32 heads as representative config

d_k = 128
H = 32

scenarios = [
    ("Long context (100K tokens)", 100_000),
    ("Retrieved context (5 x 500 tokens)", 2_500),
]

rows = []
for label, n in scenarios:
    # Attention scores per head per layer
    attn_scores = n * n
    # Attention FLOPs per layer (all heads): 2 * H * n^2 * d_k
    attn_flops = 2 * H * (n ** 2) * d_k

    if attn_scores >= 1e12:
        scores_str = f"{attn_scores / 1e12:.1f} T"
    elif attn_scores >= 1e9:
        scores_str = f"{attn_scores / 1e9:.1f} B"
    elif attn_scores >= 1e6:
        scores_str = f"{attn_scores / 1e6:.1f} M"
    else:
        scores_str = f"{attn_scores / 1e3:.1f} K"

    if attn_flops >= 1e15:
        flops_str = f"{attn_flops / 1e15:.1f} PetaFLOPs"
    elif attn_flops >= 1e12:
        flops_str = f"{attn_flops / 1e12:.1f} TeraFLOPs"
    elif attn_flops >= 1e9:
        flops_str = f"{attn_flops / 1e9:.1f} GigaFLOPs"
    else:
        flops_str = f"{attn_flops / 1e6:.1f} MegaFLOPs"

    rows.append([label, f"{n:,}", scores_str, flops_str])

# Compute the ratio
ratio = (100_000 ** 2) / (2_500 ** 2)

js.window.py_table_data = json.dumps({
    "headers": ["Approach", "Tokens (n)", "Attn Scores (n\u00b2)", "Attn FLOPs/layer"],
    "rows": rows
})

print(f"Attention cost ratio: {ratio:,.0f}x")
print(f"The long-context approach computes {ratio:,.0f} times more attention scores")
print(f"per head per layer than the retrieval approach.")
print()
print("The retrieval approach also adds a search step (~10ms for ANN lookup),")
print("but this is negligible compared to the attention savings.")

The numbers are stark: 10 billion attention scores versus 6.25 million — a 1,600x reduction . In practice, the savings are even larger because the retrieval approach also reduces KV cache memory (linear in $n$: $100{,}000 \times c$ vs $2{,}500 \times c$, a 40x reduction), prefill latency, and API costs (most providers charge per token). The trade-off is that retrieval might miss relevant information that a full-context approach would have seen — but as we've discussed, full-context models have their own recall problems ("Lost in the Middle"), so this trade-off is less severe than it first appears.

Self-Retrieval: Models That Decide When to Search

Standard RAG always retrieves. Every query, regardless of difficulty, triggers a database search. But sometimes the model already knows the answer — "What is the capital of France?" doesn't need a retrieval step. Unnecessary retrieval adds latency (the search plus encoding time), can introduce noise (irrelevant passages that confuse the model), and wastes compute. What if the model itself could decide when retrieval is needed?

(Asai et al., 2024) introduced Self-RAG (Self-Reflective Retrieval-Augmented Generation), a framework where the model learns to generate special reflection tokens that control the retrieval and generation process. The model doesn't just generate text — it generates metadata about whether it needs help.

Self-RAG uses four types of reflection tokens:

[Retrieve]: should the model retrieve external knowledge for this segment? Values: yes, no, or continue (keep generating without retrieval).
[IsRel]: is the retrieved passage relevant to the query? Values: relevant or irrelevant. If irrelevant, the passage is discarded.
[IsSup]: is the generated response supported by the retrieved passage? Values: fully supported, partially supported, or not supported.
[IsUse]: is the overall response useful? A final quality check on the complete generation.

The flow works like this. The model begins generating a response. At each segment boundary, it emits a [Retrieve] token. If the value is "yes", the system pauses generation, retrieves relevant passages, and the model evaluates each with [IsRel]. For relevant passages, the model continues generating conditioned on them, then self-evaluates with [IsSup] to check if its output is actually grounded in the evidence. This creates a generate-retrieve-critique loop that runs only when needed.

A related approach is Corrective RAG (CRAG) , which takes a slightly different angle: it always retrieves, but then evaluates whether the retrieval was useful before incorporating it. If the retrieved documents are judged irrelevant, CRAG falls back to web search or generates without retrieval context. If they're ambiguous, it refines the query and retrieves again. The key innovation is the retrieval evaluator — a lightweight model that scores retrieved documents for relevance before they ever reach the generator.

The broader trend here is significant: models are becoming retrieval-aware . Rather than treating retrieval as a fixed preprocessing step (always retrieve, always use what you get), the model itself decides when to retrieve, whether the retrieval is useful, and how much to trust the retrieved evidence versus its own parametric knowledge. This moves RAG from a rigid pipeline to an adaptive system where retrieval is one tool among many that the model can invoke as needed.

import json, js

# Simplified illustration of Self-RAG decision flow
# Shows how different query types trigger different retrieval decisions

queries = [
    ("What is the capital of France?", "no",
     "Model knows this from training — no retrieval needed"),
    ("What were Q3 2025 revenue figures for Acme Corp?", "yes",
     "Specific, time-sensitive fact — retrieval required"),
    ("Explain how gradient descent works", "no",
     "General knowledge well-covered in training data"),
    ("What did the CEO say in yesterday's earnings call?", "yes",
     "Real-time information — must retrieve"),
]

print("Self-RAG [Retrieve] Decision Examples")

table_rows = []
for query, decision, reason in queries:
    action = "RETRIEVE" if decision == "yes" else "GENERATE"
    table_rows.append([query, action, reason])

js.window.py_table_data = json.dumps({
    "headers": ["Query", "Decision", "Reason"],
    "rows": table_rows
})

💡 Self-RAG was trained using a critic model (GPT-4) to generate reflection token labels on training data. The Self-RAG model then learned to produce these tokens itself during inference. On benchmarks like PopQA and biographical generation, Self-RAG outperformed both standard RAG (which always retrieves) and vanilla LLMs (which never retrieve), showing that adaptive retrieval beats both extremes.

The Convergence of RAG and Long Context

We're witnessing a convergence between what were once two separate paradigms. Long context and retrieval are no longer competing approaches — they're merging into unified systems that blur the boundary between "what's in context" and "what's retrieved".

This convergence is happening along several fronts. First, long-context models that also retrieve . Gemini can search the web mid-conversation, seamlessly mixing its existing context (conversation history, uploaded documents) with freshly retrieved information. The user doesn't distinguish between "the model read this in its context" and "the model just looked this up" — and increasingly, the model itself doesn't either.

Second, RAG systems that leverage long-context models . Traditional RAG retrieved 3-5 short passages because older models had limited context windows. With 128K+ windows, RAG systems now retrieve 20-50 passages and let the long-context model process them all, reducing the chance that the retriever's ranking error causes the system to miss the right answer. The retriever no longer needs to be perfect — it just needs to include the right passage somewhere in a generous candidate set, and the model's long-context attention handles the rest.

Third, memory-augmented transformers (article 4) blur the line between context and retrieval entirely. Memorizing Transformers store past key-value pairs in an external memory and retrieve them via $k$-nearest-neighbour lookup during attention. Is that "extending the context window" or "retrieval"? It's both — the mechanism is retrieval (approximate nearest-neighbour search), but what's being retrieved is the model's own prior context (past key-value states), making it feel like a longer context window. RETRO goes further: it retrieves from an external database of 1.75 trillion tokens, making retrieval indistinguishable from having a truly enormous context.

The trajectory suggests a future where the distinction between "context length" and "retrieval" dissolves. The model will seamlessly access information whether it's in the current conversation (short-term context), in a compressed memory buffer (medium-term memory, as in Titans or Infini-Attention), or in an external database (long-term knowledge, as in RETRO or RAG). The user simply asks a question, and the system routes to the right knowledge source — local attention for recent context, memory lookup for earlier context, database retrieval for external knowledge — all within a single forward pass.

This convergence resolves what seemed like a fundamental tension. Long context gives the model breadth (it can see a lot of information at once), while retrieval gives it depth (it can access the right information from an arbitrarily large corpus). Combined, the model gets both — a broad working context augmented by targeted retrieval from a knowledge base that may be orders of magnitude larger than any context window.

The final article in this track examines how production models combine all these approaches — position encodings (article 2), efficient attention patterns (article 3), external memory (article 4), and retrieval-augmented context (this article) — into the long-context systems deployed at scale today.

Quiz

Test your understanding of how retrieval and long context interact.

In the RETRO architecture, how are retrieved chunks integrated into the model?

They are prepended to the input sequence, consuming context window tokens

They are injected via cross-attention layers, where the input chunk's queries attend to retrieved chunk keys and values

They replace the feed-forward layers every few blocks

They are averaged into the word embeddings before the first transformer layer

According to Xu et al. (2024), what happens when retrieval is combined with a long-context model that could already fit all the documents?

Performance degrades because retrieval introduces noise

Performance is unchanged — the model already has all the information

Performance consistently improves, even when the context window is large enough

Performance improves only on factual questions, not reasoning tasks

What distinguishes Self-RAG from standard RAG?

Self-RAG uses a larger retrieval database

Self-RAG generates reflection tokens that decide whether to retrieve, whether retrieved content is relevant, and whether the response is supported

Self-RAG retrieves from the model's own KV cache instead of an external database

Self-RAG fine-tunes the retriever jointly with the generator

For a static corporate knowledge base with millions of documents, which approach is most appropriate?

Long context only — load all documents into the context window

Retrieval — the corpus is far too large for any context window, and retrieval surfaces the relevant subset per query

Fine-tuning — encode the knowledge base into model weights

Memorizing Transformers — store all documents in the KV cache