The Knowledge Problem
A language model's knowledge is frozen at training time. Ask one about last week's news and it most likely can't answer on its own (e.g., without RAG or internet access), as it was probably trained months ago. Ask about your company's internal documentation and it has almost certainly never seen those pages. Everything the model knows is encoded in its weights, and those weights stopped updating when training ended.
This causes two compounding problems: staleness (the world changes faster than models are retrained) and hallucination (when asked about something outside its training data, a model can generate a confident, fluent, yet entirely wrong answer, because during pre-training it was incentivised to predict plausible continuations for every sentence presented to it, while not incentivised enough to say "I don't know" when the plausible answer happens to be wrong).
So how do we give a model access to knowledge that wasn't in its training data, or that has changed since? Retrieval-Augmented Generation (RAG) (Lewis et al., 2020) adds a retrieval step before generation. Instead of relying solely on its weights, the model first retrieves relevant documents from an external corpus, then generates an answer grounded in those documents. The knowledge lives in a corpus that can be swapped or updated without retraining.
How Do We Find the Right Documents?
If we want to look up relevant documents before answering, we need a way to figure out which documents in a corpus are actually relevant to a given query. More formally, given a query $q$ and a corpus $\mathcal{D} = \{d_1, d_2, \ldots, d_N\}$ of $N$ documents, we need a scoring function:
This entire track is about building better versions of $\text{score}(q, d)$, starting simple and building up:
- Sparse / lexical (article 2): score based on word overlap (TF-IDF, BM25). Fast and interpretable, but only matches exact words.
- Dense / semantic (articles 3โ4): encode queries and documents as vectors with neural networks ($\text{score}(q, d) = E_Q(q)^\top E_D(d)$). Captures paraphrases and meaning, but requires training data and more compute.
- Hybrid (article 5): combine sparse and dense scores to get the benefits of both.
- Reranking (article 6): use a more expensive model to rescore a small candidate set with full query-document attention.
There's a scale problem, though. With $N = 10^7$ documents, we can't compute $\text{score}(q, d)$ for every $d$ at query time. We need ways to narrow down candidates cheaply before applying expensive scoring, which means computing and storing document representations ahead of time (precomputation), organising them for fast lookup (indexing), and using approximate nearest-neighbour algorithms that check only a fraction of the corpus per query (all covered in article 7).
Let's see what this looks like concretely. The code below scores a five-sentence corpus against a query using bag-of-words TF-IDF. By article 9 we'll be training the neural encoder that replaces it, but for now this is our starting point.
import math
from collections import Counter
corpus = [
"BERT is a bidirectional transformer pre-trained on masked language modelling",
"BM25 is a probabilistic term-weighting retrieval model",
"Dense retrieval encodes queries and documents as vectors in a shared space",
"Transformers use attention mechanisms to model token dependencies",
"Retrieval-Augmented Generation grounds language model outputs in external documents",
]
query = "how does dense retrieval work"
def tokenize(text):
return text.lower().split()
def tfidf_score(query_tokens, doc_tokens, all_docs_tokens):
N = len(all_docs_tokens)
tf = Counter(doc_tokens)
score = 0.0
for t in set(query_tokens):
df = sum(1 for d in all_docs_tokens if t in d)
if df == 0:
continue
idf = math.log((N + 1) / (df + 0.5))
score += (tf[t] / len(doc_tokens)) * idf
return score
all_tok = [tokenize(d) for d in corpus]
q_tok = tokenize(query)
scores = [(tfidf_score(q_tok, dt, all_tok), corpus[i]) for i, dt in enumerate(all_tok)]
scores.sort(reverse=True)
print(f"Query: '{query}'\n")
for rank, (s, doc) in enumerate(scores, 1):
snippet = doc[:57].ljust(57)
print(f" #{rank} score={s:.4f} | {snippet}...")
Why Not Just Fine-Tune?
If models struggle with factual knowledge, why not just fine-tune them on the right documents? It works in some cases, but runs into practical problems that RAG avoids:
- Catastrophic forgetting: fine-tuning on new documents pulls the weights toward the new examples and away from everything not in the current batch. Without careful replay sampling (mixing old and new data throughout training), the model degrades on previously learned knowledge. At scale, this is expensive to manage.
- Frequency requirements: facts need to appear many times across the training corpus to be reliably encoded into weights. A single source document, however important, rarely sticks. Internet-scale models recall popular facts precisely because those facts appeared in millions of web pages.
- Costly iteration: retraining or fine-tuning is slow and expensive. RAG lets us update knowledge by swapping the corpus (no training run, no hyperparameter tuning, no evaluation cycle). For fast-moving domains, this difference is decisive.
- Attribution: retrieved documents are explicit context, so we can show users exactly which passage grounded the answer. Purely parametric models have no mechanism for this (there is no source passage to point to).
In practice, RAG tends to work best for facts, while fine-tuning works best for skills. Teaching a model a new output format, a domain-specific reasoning style, or a tool API are behavioural changes that benefit from fine-tuning (habits to instill, not facts to retrieve). The best production systems tend to combine both: a fine-tuned model that knows how to reason, paired with a RAG pipeline that provides the what .
How a RAG Pipeline Fits Together
A RAG system has two phases that run at very different speeds:
- Offline (index time): chunk the corpus into passages, encode each passage into a vector, and store the result in an index. This happens once (updated as the corpus changes) and can afford to be slow (seconds per document).
- Online (query time): encode the user's query (milliseconds), search the index for the top-$k$ nearest passages (milliseconds via approximate search), assemble the context, and generate. The whole retrieval step needs to finish in under 100 ms.
This split exists because document encoding is expensive but only needs to happen once, while query encoding must be cheap because it happens on every request. The tradeoff is that if we update documents but don't re-index them, the index goes out of sync (returning stale or missing passages until we rebuild or incrementally update it, a practical challenge covered in article 7).
Putting this together in code, the two phases look like this:
# โโ Offline (run once, update on corpus changes) โโโโโโโโโโโโโโโโโโโโโโ
chunks = chunk_documents(corpus) # split docs into passages
vectors = encoder.encode(chunks) # one vector per passage
index.add(vectors, metadata=chunks) # store in ANN index (HNSW etc.)
# โโ Online (per query, must be fast) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
q_vec = encoder.encode(query) # encode user query
results = index.search(q_vec, top_k=20) # ANN search
# Optional: rerank, filter, deduplicate
results = reranker.rerank(query, results, top_k=5)
context = format_context(results) # build prompt context
answer = llm.generate(query, context) # grounded generation
Look at line 6 in the snippet above: index.search(q_vec, top_k=20) . How does the index decide which 20 passages are closest? That depends entirely on how we scored them. The next article starts with the simplest scoring idea (counting word overlaps) and builds up to TF-IDF and BM25, the formulas behind virtually every search engine before neural retrieval took over.
Quiz
Test your understanding of the retrieval-augmented generation setup.
What is the core limitation that RAG addresses that fine-tuning cannot?
In the retrieval task, what does top-k refer to?
Why does a RAG system split into offline and online phases?
When would you choose fine-tuning over RAG?