The Full RAG Pipeline: Query, Retrieve, Generate

Why Does the Simple Version Break?

At this point we have all the pieces (chunking, indexing, retrieval, and reranking), and the simplest way to connect them is to embed the user query, retrieve the top-$k$ chunks, concatenate them into the LLM context, and generate. This naive pipeline works well as a baseline, but it fails in predictable ways that become obvious once we look at real queries.

First, the query itself can be ambiguous or poorly phrased. "Tell me about the model" could mean a machine learning model, a fashion model, or a scale model, so the retriever returns a scattered mix of chunks and the LLM hallucinates a coherent answer from incoherent context. Second, the answer sometimes requires synthesis across multiple documents. A query like "Compare the performance of BM25 and ColBERT on BEIR" needs information from different papers combined together, and a single top-$k$ retrieval step may return only one side of the comparison. Third, the retrieved context can be irrelevant or contradictory, yet the LLM uses it anyway and produces a confident wrong answer.

Production RAG systems add steps before, during, and after generation to address each of these failures. The rest of this article walks through those steps.

Can We Fix the Query Before Searching?

The first failure (ambiguous or poorly phrased queries) suggests an obvious fix: transform the raw query before it ever reaches the retriever. There are several ways to do this, each trading a bit of latency for better recall.

Query rewriting uses an LLM to rephrase the query into a form that better matches the style of documents in the corpus. When users ask conversational questions but the documents are formal, a rewriter bridges the gap (for example, turning "what's the deal with attention?" into "How does the self-attention mechanism work in transformer models?").

Query expansion takes a different angle. Instead of rewriting one query, we generate $n$ query variants and union or fuse the retrieval results, which tends to improve recall at the cost of $n\times$ retrieval latency. A related technique is Step-Back Prompting (Zheng et al., 2023) , which asks an LLM to generate a more general "step-back" question alongside the specific one and retrieves for both.

Hypothetical Document Embeddings (HyDE) (Gao et al., 2022) addresses query-document mismatch from yet another direction. Rather than embedding the query directly, we ask the LLM to generate a hypothetical document that would answer the query, then embed that document and use it as the query vector. The reason this helps is that a hypothetical answer document tends to be closer in embedding space to real answer documents than the original question is, since it shares vocabulary and structure with the corpus.

\mathbf{e}_{\text{query}} = \text{Encode}(\text{LLM}(\text{``Write a document that answers: ''} + q))

HyDE tends to help most when queries are very short (single keywords) or stylistically far from the documents in the corpus. The tradeoff is one extra LLM call before retrieval, which can add a few hundred milliseconds of latency, so it is worth measuring whether the recall gain justifies the cost for a given use case.

What if One Retriever Is Not Enough?

Not all queries need the same retrieval strategy. A production system might have multiple data sources (a vector store for semantic search over documentation, a SQL database for structured product data, a BM25 index over customer support tickets), and a router implemented as an LLM classifier or a small fine-tuned model can direct each query to the appropriate source.

For complex queries, a single source often is not enough, so the system retrieves from multiple sources in parallel and fuses the results. The router may not choose one source exclusively but instead assign weights or instruct the fusion layer to draw from specific sources.

Multi-step retrieval tackles the synthesis problem we identified earlier (queries that need information from several documents). The LLM breaks the query into sub-questions, retrieves for each, reads the results, and then decides whether to retrieve more or to generate. This is the core pattern behind ReAct (Yao et al., 2022) (Reason + Act) agents, where the model interleaves reasoning steps with retrieval actions to build up context iteratively.

How Should We Assemble the Context?

Once we have the retrieved chunks, we need to assemble them into a prompt, and several decisions here can make or break answer quality.

Chunk ordering matters because LLMs tend to be less influenced by information in the middle of a long context than by information at the beginning or end (Liu et al., 2023) . Placing the most relevant chunks first helps mitigate this "lost in the middle" effect.
Overlapping chunks from nearby passages can add nearly identical text to the context, so we should deduplicate by embedding similarity before inserting them into the prompt.
Including source metadata (document title, section, URL) alongside each chunk helps the LLM attribute claims and lets users trace answers back to their sources.
Even with 128K-token context windows, long contexts tend to slow generation and dilute attention. In practice, many teams find that around 5 to 10 chunks of roughly 500 tokens each is a reasonable starting point, though the optimal number depends on the task.

The system prompt should instruct the LLM to answer based on the provided context, to say "I don't know" when the context does not contain the answer, and to cite sources. Without these instructions, the LLM tends to fall back on its parametric knowledge to fill retrieval gaps, which can produce confident but wrong answers.

What if the Answer Is Wrong?

Even after careful query optimisation and context assembly, the generated answer can still be wrong. A self-reflection step addresses this by asking the LLM (or a separate critic model) to evaluate its own output. Is this answer grounded in the retrieved context? Is the context actually relevant? If not, should we retrieve again with a different query?

SELF-RAG (Asai et al., 2023) formalises this idea by training a model with special reflection tokens. During generation, the model emits tokens like [Retrieve], [IsRel], [IsSup], and [IsUse] to signal when it needs to retrieve, whether the retrieved document is relevant, whether a claim is supported, and whether the generation is useful. Because the model is trained end-to-end to emit these tokens, the reflection is baked into generation rather than bolted on as a separate step.

Corrective RAG (CRAG) (Yan et al., 2024) takes a lighter-weight approach by adding an evaluator that scores the relevance of retrieved documents. If all retrieved documents score below a threshold, CRAG triggers a web search to fetch fresh documents; if only some are relevant, it decomposes and filters the retrieved set before generation. This works well in practice because retrieval quality varies from query to query, and detecting low-quality retrieval is much cheaper than always falling back to web search.

The following pseudocode shows how these ideas fit together in a retrieve-read-reflect loop. On each iteration the system retrieves, filters for relevance, generates, and then checks whether the answer is grounded before deciding to return it or try again with a rewritten query.

# Skeleton of a self-reflective RAG pipeline
# (pseudocode — illustrates the control flow)

def rag_with_reflection(query, retriever, llm, max_iterations=3):
    context = []
    for iteration in range(max_iterations):
        # Optionally rewrite the query on subsequent iterations
        effective_query = llm.rewrite(query, context) if iteration > 0 else query

        # Retrieve candidates
        candidates = retriever.retrieve(effective_query, top_k=10)

        # Score relevance of each candidate
        relevant = [c for c in candidates if llm.is_relevant(query, c) > 0.5]

        if not relevant:
            # No relevant docs found — try web search or return "I don't know"
            if iteration == max_iterations - 1:
                return "I could not find relevant information to answer this question."
            continue  # retry with rewritten query

        context = relevant[:5]
        answer = llm.generate(query, context)

        # Check if the answer is grounded in the context
        if llm.is_grounded(answer, context):
            return answer
        # If not grounded, retry
    return answer  # Return best attempt

💡 The retrieve-read-reflect loop adds latency proportional to the number of iterations. In production, a single reflection step (check whether the retrieved documents are relevant, regenerate once if not) tends to give most of the benefit with acceptable latency overhead.

Quiz

Test our understanding of the end-to-end RAG pipeline.

How does HyDE improve retrieval?

Fine-tuning the embedding model on the query distribution

Embedding a hypothetical answer document instead of the raw query

Expanding the query with synonyms from WordNet

Running BM25 and dense retrieval in parallel

What does the 'lost in the middle' problem refer to?

Chunks that are too long for the retriever to embed accurately

LLMs being less influenced by information in the middle of a long context

Documents that are not indexed due to chunking errors

Queries that do not match any document in the corpus

Why does the SELF-RAG model emit special tokens like [Retrieve] and [IsRel]?

Mark which sentences were copied from the retrieved context

Control when to retrieve, whether retrieved docs are relevant, and whether claims are supported

Indicate the confidence level of each generated sentence

Format the output for downstream parsing

Why is it important to include 'say I don't know when context doesn't contain the answer' in the system prompt?

To prevent the LLM from using too many tokens

To stop the LLM from using its parametric knowledge to fill retrieval gaps with potentially incorrect information

To ensure the LLM only answers questions about the specific corpus

To reduce hallucination by disabling the LLM's reasoning ability

What is the core control flow pattern of ReAct agents in multi-step retrieval?

Generate the full answer, then verify each claim separately

Interleave reasoning steps with retrieval actions, building context iteratively

Retrieve all documents first, then reason over the full corpus

Use separate models for reasoning and retrieval