Why Does the Simple Version Break?
At this point we have all the pieces (chunking, indexing, retrieval, and reranking), and the simplest way to connect them is to embed the user query, retrieve the top-$k$ chunks, concatenate them into the LLM context, and generate. This naive pipeline works well as a baseline, but it fails in predictable ways that become obvious once we look at real queries.
First, the query itself can be ambiguous or poorly phrased. "Tell me about the model" could mean a machine learning model, a fashion model, or a scale model, so the retriever returns a scattered mix of chunks and the LLM hallucinates a coherent answer from incoherent context. Second, the answer sometimes requires synthesis across multiple documents. A query like "Compare the performance of BM25 and ColBERT on BEIR" needs information from different papers combined together, and a single top-$k$ retrieval step may return only one side of the comparison. Third, the retrieved context can be irrelevant or contradictory, yet the LLM uses it anyway and produces a confident wrong answer.
Production RAG systems add steps before, during, and after generation to address each of these failures. The rest of this article walks through those steps.
Can We Fix the Query Before Searching?
The first failure (ambiguous or poorly phrased queries) suggests an obvious fix: transform the raw query before it ever reaches the retriever. There are several ways to do this, each trading a bit of latency for better recall.
Query rewriting uses an LLM to rephrase the query into a form that better matches the style of documents in the corpus. When users ask conversational questions but the documents are formal, a rewriter bridges the gap (for example, turning "what's the deal with attention?" into "How does the self-attention mechanism work in transformer models?").
Query expansion takes a different angle. Instead of rewriting one query, we generate $n$ query variants and union or fuse the retrieval results, which tends to improve recall at the cost of $n\times$ retrieval latency. A related technique is Step-Back Prompting (Zheng et al., 2023) , which asks an LLM to generate a more general "step-back" question alongside the specific one and retrieves for both.
Hypothetical Document Embeddings (HyDE) (Gao et al., 2022) addresses query-document mismatch from yet another direction. Rather than embedding the query directly, we ask the LLM to generate a hypothetical document that would answer the query, then embed that document and use it as the query vector. The reason this helps is that a hypothetical answer document tends to be closer in embedding space to real answer documents than the original question is, since it shares vocabulary and structure with the corpus.
HyDE tends to help most when queries are very short (single keywords) or stylistically far from the documents in the corpus. The tradeoff is one extra LLM call before retrieval, which can add a few hundred milliseconds of latency, so it is worth measuring whether the recall gain justifies the cost for a given use case.
What if One Retriever Is Not Enough?
Not all queries need the same retrieval strategy. A production system might have multiple data sources (a vector store for semantic search over documentation, a SQL database for structured product data, a BM25 index over customer support tickets), and a router implemented as an LLM classifier or a small fine-tuned model can direct each query to the appropriate source.
For complex queries, a single source often is not enough, so the system retrieves from multiple sources in parallel and fuses the results. The router may not choose one source exclusively but instead assign weights or instruct the fusion layer to draw from specific sources.
Multi-step retrieval tackles the synthesis problem we identified earlier (queries that need information from several documents). The LLM breaks the query into sub-questions, retrieves for each, reads the results, and then decides whether to retrieve more or to generate. This is the core pattern behind ReAct (Yao et al., 2022) (Reason + Act) agents, where the model interleaves reasoning steps with retrieval actions to build up context iteratively.
How Should We Assemble the Context?
Once we have the retrieved chunks, we need to assemble them into a prompt, and several decisions here can make or break answer quality.
- Chunk ordering matters because LLMs tend to be less influenced by information in the middle of a long context than by information at the beginning or end (Liu et al., 2023) . Placing the most relevant chunks first helps mitigate this "lost in the middle" effect.
- Overlapping chunks from nearby passages can add nearly identical text to the context, so we should deduplicate by embedding similarity before inserting them into the prompt.
- Including source metadata (document title, section, URL) alongside each chunk helps the LLM attribute claims and lets users trace answers back to their sources.
- Even with 128K-token context windows, long contexts tend to slow generation and dilute attention. In practice, many teams find that around 5 to 10 chunks of roughly 500 tokens each is a reasonable starting point, though the optimal number depends on the task.
The system prompt should instruct the LLM to answer based on the provided context, to say "I don't know" when the context does not contain the answer, and to cite sources. Without these instructions, the LLM tends to fall back on its parametric knowledge to fill retrieval gaps, which can produce confident but wrong answers.
What if the Answer Is Wrong?
Even after careful query optimisation and context assembly, the generated answer can still be wrong. A self-reflection step addresses this by asking the LLM (or a separate critic model) to evaluate its own output. Is this answer grounded in the retrieved context? Is the context actually relevant? If not, should we retrieve again with a different query?
SELF-RAG (Asai et al., 2023) formalises this idea by training a model with special reflection tokens. During generation, the model emits tokens like [Retrieve], [IsRel], [IsSup], and [IsUse] to signal when it needs to retrieve, whether the retrieved document is relevant, whether a claim is supported, and whether the generation is useful. Because the model is trained end-to-end to emit these tokens, the reflection is baked into generation rather than bolted on as a separate step.
Corrective RAG (CRAG) (Yan et al., 2024) takes a lighter-weight approach by adding an evaluator that scores the relevance of retrieved documents. If all retrieved documents score below a threshold, CRAG triggers a web search to fetch fresh documents; if only some are relevant, it decomposes and filters the retrieved set before generation. This works well in practice because retrieval quality varies from query to query, and detecting low-quality retrieval is much cheaper than always falling back to web search.
The following pseudocode shows how these ideas fit together in a retrieve-read-reflect loop. On each iteration the system retrieves, filters for relevance, generates, and then checks whether the answer is grounded before deciding to return it or try again with a rewritten query.
# Skeleton of a self-reflective RAG pipeline
# (pseudocode — illustrates the control flow)
def rag_with_reflection(query, retriever, llm, max_iterations=3):
context = []
for iteration in range(max_iterations):
# Optionally rewrite the query on subsequent iterations
effective_query = llm.rewrite(query, context) if iteration > 0 else query
# Retrieve candidates
candidates = retriever.retrieve(effective_query, top_k=10)
# Score relevance of each candidate
relevant = [c for c in candidates if llm.is_relevant(query, c) > 0.5]
if not relevant:
# No relevant docs found — try web search or return "I don't know"
if iteration == max_iterations - 1:
return "I could not find relevant information to answer this question."
continue # retry with rewritten query
context = relevant[:5]
answer = llm.generate(query, context)
# Check if the answer is grounded in the context
if llm.is_grounded(answer, context):
return answer
# If not grounded, retry
return answer # Return best attempt
Quiz
Test our understanding of the end-to-end RAG pipeline.
How does HyDE improve retrieval?
What does the 'lost in the middle' problem refer to?
Why does the SELF-RAG model emit special tokens like [Retrieve] and [IsRel]?
Why is it important to include 'say I don't know when context doesn't contain the answer' in the system prompt?
What is the core control flow pattern of ReAct agents in multi-step retrieval?