Why RAG? The Retrieval Problem

The Knowledge Problem

A language model's knowledge is frozen at training time. Ask one about last week's news and it won't be able to answer on its own (without RAG or internet access), as it was probably trained months ago. Ask about your company's internal documentation and it has almost certainly never seen those pages. Everything the model knows is encoded in its weights, and those weights stopped updating when training ended.

This causes two compounding problems: staleness (the world changes faster than models are retrained) and hallucination (when asked about something outside its training data, a model can generate a confident, fluent, yet entirely wrong answer, because during pre-training it was incentivised to predict plausible continuations for every sentence presented to it, while not incentivised enough to say "I don't know" when the plausible answer happens to be wrong).

So how do we give a model access to knowledge that wasn't in its training data, or that has changed since? Retrieval-Augmented Generation (RAG) (Lewis et al., 2020) adds a retrieval step before generation. Instead of relying solely on its weights, the model first retrieves relevant documents from an external corpus, then generates an answer grounded in those documents. The knowledge lives in a corpus that can be swapped or updated without retraining.

💡 The original RAG paper (Lewis et al., 2020) showed that retrieval-augmented models outperform purely parametric models on open-domain QA tasks while being far easier to update: swap the corpus, no retraining needed.

How Do We Find the Right Documents?

If we want to look up relevant documents before answering, we need a way to figure out which documents in a corpus are actually relevant to a given query. More formally, given a query $q$ and a corpus $\mathcal{D} = \{d_1, d_2, \ldots, d_N\}$ of $N$ documents, we need a scoring function $\mathcal{R}(q, \mathcal{D})$ that ranks our documents by relevance and returns the $k$ most relevant of them (with $k$ a hyperparameter you can configure at any time):

\mathcal{R}(q, \mathcal{D}) = \underset{d \in \mathcal{D}}{\text{top-}k} \; \text{score}(q, d)

This entire track is about building better versions of $\text{score}(q, d)$, starting simple and building up:

Sparse / lexical methods: in article 2, we start by scoring based on word overlap (TF-IDF, BM25). Fast and interpretable, but only matches exact words.
Dense / semantic methods: in articles 3–4, we move on to encoding queries and documents as vectors with neural networks ($\text{score}(q, d) = E_Q(q)^\top E_D(d)$). Captures paraphrases and meaning, but requires training data and more compute.
Hybrid: in article 5, we explore combining sparse and dense scores to get the benefits of both.
Reranking: finally in article 6, we reorder a small candidate set by using a more expensive model with full query-document attention.

There's a scale problem, though. With $N = 10^7$ documents, we can't compute $\text{score}(q, d)$ for every $d$ at query time. We need ways to narrow down candidates cheaply before applying expensive scoring, which means computing and storing document representations ahead of time (precomputation), organising them for fast lookup (indexing), and using approximate nearest-neighbour algorithms that check only a fraction of the corpus per query, all of which we will cover in article 7.

Why Not Just Fine-Tune?

If models struggle with factual knowledge, why not just fine-tune them on the right documents? It works in some cases, but runs into practical problems that RAG avoids:

Catastrophic forgetting: fine-tuning on new documents pulls the weights toward the new examples and away from everything not in the current batch. Without careful replay sampling (mixing old and new data throughout training), the model degrades on previously learned knowledge. At scale, this is expensive to manage.
Frequency requirements: facts need to appear many times across the training corpus to be reliably encoded into weights. A single source document, however important, rarely sticks.
Costly iteration: retraining or fine-tuning is slow and expensive. RAG lets us update knowledge by swapping the corpus, avoiding running training runs with all they entail (hyperparameter search, etc.). For fast-moving domains, this difference is decisive.
Attribution: retrieved documents are explicit context, so we can show users exactly which passage grounded the answer. Purely parametric models, however, have no mechanism for this (there is no source passage to point to).

In practice, RAG tends to work best for facts, while fine-tuning works best for skills. Teaching a model a new output format, a domain-specific reasoning style, or a tool API are behavioural changes that benefit from fine-tuning (habits to instill, not facts to retrieve). The best production systems tend to combine both: a fine-tuned model that knows how to reason, paired with a RAG pipeline that provides what to reason about.

How a RAG Pipeline Fits Together

Putting it all together, a RAG system has two phases that run at very different speeds:

Offline (index time): chunk the corpus into passages, encode each passage into a vector, and store the result in an index. This happens once (updated as the corpus changes) and can afford to be slow (seconds per document).
Online (query time): encode the user's query, search the index for the top-$k$ nearest passages via approximate search, assemble the context, and generate. This phase needs to be fast, since it runs on every user request.

This split exists because document encoding is expensive but only needs to happen once, while query encoding must be cheap because it happens on every request. The tradeoff is that if we update documents but don't re-index them, the index goes out of sync (returning stale or missing passages until we rebuild or incrementally update it, a practical challenge covered in article 7).

In code, the two phases look like this:

# ── Offline (run once, update on corpus changes) ──────────────────────
chunks   = chunk_documents(corpus)          # split docs into passages
vectors  = encoder.encode(chunks)           # one vector per passage
index.add(vectors, metadata=chunks)        # store in ANN index (HNSW etc.)

# ── Online (per query, must be fast) ──────────────────────────────────
q_vec    = encoder.encode(query)            # encode user query
results  = index.search(q_vec, top_k=20)   # ANN search

# Optional: rerank, filter, deduplicate
results  = reranker.rerank(query, results, top_k=5)

context  = format_context(results)         # build prompt context
answer   = llm.generate(query, context)    # grounded generation

Of note, look at line 6 in the snippet above: index.search(q_vec, top_k=20) . How does the index decide which 20 passages are closest? That depends entirely on how we scored them. The next article starts with the simplest scoring idea (counting word overlaps) and builds up to TF-IDF and BM25, the formulas behind virtually every search engine before neural retrieval took over.

Quiz

Test your understanding of the retrieval-augmented generation setup.

What is the core limitation that RAG addresses that fine-tuning cannot?

Fine-tuned models are slower at inference

Fine-tuning encodes knowledge into frozen weights; RAG allows knowledge to be updated without retraining

Fine-tuned models cannot follow instructions

RAG models are always larger than fine-tuned models

In the retrieval task, what does top-k refer to?

The k tokens with the highest attention weights

The k documents with the highest score(q, d) for query q

The k most frequent terms in the corpus

The k nearest words in the vocabulary

Why does a RAG system split into offline and online phases?

It allows different teams to work on encoding and retrieval separately

Expensive document encoding happens offline so query-time retrieval can be fast (<100ms)

The offline index is used for training and the online index for serving

It prevents the model from seeing the corpus during training

When would you choose fine-tuning over RAG?

When you need to update facts without retraining

When you need to improve the model's reasoning style, output format, or domain-specific skills

When you need attribution to source documents

When the corpus is too large to fine-tune on