The Knowledge Problem
A language model's knowledge is frozen at training time. Ask one about last week's news and it won't be able to answer on its own (without RAG or internet access), as it was probably trained months ago. Ask about your company's internal documentation and it has almost certainly never seen those pages. Everything the model knows is encoded in its weights, and those weights stopped updating when training ended.
This causes two compounding problems: staleness (the world changes faster than models are retrained) and hallucination (when asked about something outside its training data, a model can generate a confident, fluent, yet entirely wrong answer, because during pre-training it was incentivised to predict plausible continuations for every sentence presented to it, while not incentivised enough to say "I don't know" when the plausible answer happens to be wrong).
So how do we give a model access to knowledge that wasn't in its training data, or that has changed since? Retrieval-Augmented Generation (RAG) (Lewis et al., 2020) adds a retrieval step before generation. Instead of relying solely on its weights, the model first retrieves relevant documents from an external corpus, then generates an answer grounded in those documents. The knowledge lives in a corpus that can be swapped or updated without retraining.
How Do We Find the Right Documents?
If we want to look up relevant documents before answering, we need a way to figure out which documents in a corpus are actually relevant to a given query. More formally, given a query $q$ and a corpus $\mathcal{D} = \{d_1, d_2, \ldots, d_N\}$ of $N$ documents, we need a scoring function $\mathcal{R}(q, \mathcal{D})$ that ranks our documents by relevance and returns the $k$ most relevant of them (with $k$ a hyperparameter you can configure at any time):
This entire track is about building better versions of $\text{score}(q, d)$, starting simple and building up:
- Sparse / lexical methods: in article 2, we start by scoring based on word overlap (TF-IDF, BM25). Fast and interpretable, but only matches exact words.
- Dense / semantic methods: in articles 3โ4, we move on to encoding queries and documents as vectors with neural networks ($\text{score}(q, d) = E_Q(q)^\top E_D(d)$). Captures paraphrases and meaning, but requires training data and more compute.
- Hybrid: in article 5, we explore combining sparse and dense scores to get the benefits of both.
- Reranking: finally in article 6, we reorder a small candidate set by using a more expensive model with full query-document attention.
There's a scale problem, though. With $N = 10^7$ documents, we can't compute $\text{score}(q, d)$ for every $d$ at query time. We need ways to narrow down candidates cheaply before applying expensive scoring, which means computing and storing document representations ahead of time (precomputation), organising them for fast lookup (indexing), and using approximate nearest-neighbour algorithms that check only a fraction of the corpus per query, all of which we will cover in article 7.
Why Not Just Fine-Tune?
If models struggle with factual knowledge, why not just fine-tune them on the right documents? It works in some cases, but runs into practical problems that RAG avoids:
- Catastrophic forgetting: fine-tuning on new documents pulls the weights toward the new examples and away from everything not in the current batch. Without careful replay sampling (mixing old and new data throughout training), the model degrades on previously learned knowledge. At scale, this is expensive to manage.
- Frequency requirements: facts need to appear many times across the training corpus to be reliably encoded into weights. A single source document, however important, rarely sticks.
- Costly iteration: retraining or fine-tuning is slow and expensive. RAG lets us update knowledge by swapping the corpus, avoiding running training runs with all they entail (hyperparameter search, etc.). For fast-moving domains, this difference is decisive.
- Attribution: retrieved documents are explicit context, so we can show users exactly which passage grounded the answer. Purely parametric models, however, have no mechanism for this (there is no source passage to point to).
In practice, RAG tends to work best for facts, while fine-tuning works best for skills. Teaching a model a new output format, a domain-specific reasoning style, or a tool API are behavioural changes that benefit from fine-tuning (habits to instill, not facts to retrieve). The best production systems tend to combine both: a fine-tuned model that knows how to reason, paired with a RAG pipeline that provides what to reason about.
How a RAG Pipeline Fits Together
Putting it all together, a RAG system has two phases that run at very different speeds:
- Offline (index time): chunk the corpus into passages, encode each passage into a vector, and store the result in an index. This happens once (updated as the corpus changes) and can afford to be slow (seconds per document).
- Online (query time): encode the user's query, search the index for the top-$k$ nearest passages via approximate search, assemble the context, and generate. This phase needs to be fast, since it runs on every user request.
This split exists because document encoding is expensive but only needs to happen once, while query encoding must be cheap because it happens on every request. The tradeoff is that if we update documents but don't re-index them, the index goes out of sync (returning stale or missing passages until we rebuild or incrementally update it, a practical challenge covered in article 7).
In code, the two phases look like this:
# โโ Offline (run once, update on corpus changes) โโโโโโโโโโโโโโโโโโโโโโ
chunks = chunk_documents(corpus) # split docs into passages
vectors = encoder.encode(chunks) # one vector per passage
index.add(vectors, metadata=chunks) # store in ANN index (HNSW etc.)
# โโ Online (per query, must be fast) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
q_vec = encoder.encode(query) # encode user query
results = index.search(q_vec, top_k=20) # ANN search
# Optional: rerank, filter, deduplicate
results = reranker.rerank(query, results, top_k=5)
context = format_context(results) # build prompt context
answer = llm.generate(query, context) # grounded generation
Of note, look at line 6 in the snippet above:
index.search(q_vec, top_k=20)
. How does the index decide which 20 passages are closest? That depends entirely on how we scored them. The next article starts with the simplest scoring idea (counting word overlaps) and builds up to TF-IDF and BM25, the formulas behind virtually every search engine before neural retrieval took over.
Quiz
Test your understanding of the retrieval-augmented generation setup.
What is the core limitation that RAG addresses that fine-tuning cannot?
In the retrieval task, what does top-k refer to?
Why does a RAG system split into offline and online phases?
When would you choose fine-tuning over RAG?