Reranking: Precision at the Top

Why the Top 5 Need Special Treatment

Everything we have built so far (BM25, bi-encoders, ColBERT, hybrid fusion) trades scoring depth for speed, returning 100 to 1000 candidates on the assumption that the relevant documents land somewhere in that set. That assumption is usually correct, but a RAG pipeline does not feed all 1000 candidates to the LLM. It feeds the top 5, maybe 10, and if the truly relevant documents sit at positions 40 and 73, they never reach the model.

Once we have a manageable candidate set from the first phase, though, we can afford to run a much more expensive scoring function on each one. A cross-encoder, for instance, reads query and document jointly through every transformer layer and outputs a single relevance score, which tends to be far more accurate than anything a bi-encoder can produce (because bi-encoders never let query and document tokens attend to each other). Scoring 100 candidates this way takes a fraction of a second on a GPU, whereas scoring millions would take hours. This two-phase structure (cheap recall first, expensive precision second) is what the information retrieval community calls reranking.

In practice, the first-stage retriever returns $K$ candidates (often $K = 100$), the reranker rescores all $K$ against the query, and only the top $k$ (often $k = 5$) go to the downstream application.

How Cross-Encoders Score Relevance

A cross-encoder concatenates query and document into a single sequence ([CLS] query [SEP] document [SEP]) and passes it through a transformer. Because every query token attends to every document token through all layers, the model captures fine-grained interactions that independent encoding misses entirely. A classification or regression head on top of the [CLS] representation then outputs a scalar relevance score.

Training these models typically relies on pairwise losses over relevance-labelled data. Given a query, a relevant document, and a non-relevant document, we want the model to assign a higher score to the relevant one. The RankNet loss expresses this as binary cross-entropy on the score difference.

\mathcal{L}_{\text{pair}} = -\log \sigma(s_{d^+} - s_{d^-})

Here $s_{d^+}$ and $s_{d^-}$ are the reranker scores for the relevant and non-relevant documents, and $\sigma$ is the sigmoid. Minimising this loss pushes the model to widen the margin between positive and negative scores.

MS MARCO provides convenient training data for this setup, since each query has one annotated positive passage while every other passage serves as an implicit negative. Hard negatives drawn from BM25 top results are particularly valuable because they are "look-alike" documents that rank well under a cheap retriever but are not actually relevant, so training against them forces the model to learn the distinctions that matter most at the boundary between relevant and irrelevant.

💡 Mixedbread's mxbai-rerank and Cohere's Rerank API are cross-encoder rerankers available as hosted services. They simplify the pipeline to a single API call after retrieval, at the cost of a network round-trip and per-query fees.

What if We Frame Reranking as Text Generation?

Cross-encoders work well, but they require training a classification head and a dedicated fine-tuning loop. Nogueira et al. (2020) noticed that sequence-to-sequence models already know how to answer yes/no questions, so why not just ask one? Their model, monoT5, takes a prompt of the form "Query: <query> Document: <document> Relevant:" and generates a single token, either "true" or "false". The relevance score is simply the log-probability of generating "true".

s(q, d) = \log P_{T5}(\text{``true''} \mid \text{``Query: } q \text{ Document: } d \text{ Relevant:''})

Because T5 is pretrained on diverse sequence-to-sequence tasks, it already has strong language understanding, and the true/false framing slots naturally into what it already knows how to do. An appealing side effect is that the output is a calibrated probability: a document with $\log P(\text{true}) = -0.1$ is far more confidently relevant than one at $\log P(\text{true}) = -3.5$, which makes it straightforward to set thresholds for minimum relevance.

Fine-tuning follows the same data recipe as cross-encoders (MS MARCO, positive passages mapped to "true", negatives to "false"). Despite the simplicity of this setup, monoT5-3B achieved state-of-the-art results on MS MARCO passage ranking when it was released, and smaller variants like monoT5-220M remain competitive baselines.

Can an LLM Rerank a Whole List at Once?

Both cross-encoders and monoT5 score documents one at a time (pointwise), which means they never compare candidates against each other. Sun et al. (2023) proposed RankGPT, which takes a different approach: hand the LLM the query and a numbered list of 20 passages, then ask it to output a permutation representing the relevance order. A prompt along these lines does the job.

"I will provide you with 20 passages. Rank them by relevance to the query. Output only the passage numbers from most to least relevant, separated by commas. Query: [query] Passages: [1] [text1] [2] [text2] ... Ranking:"

The LLM outputs something like "3, 7, 1, 12, ..." and we have a full reranking. Because the model sees all candidates simultaneously, it can resolve ties and near-ties that pointwise scorers miss. The cost, however, is steep: fitting 20 full-text documents into context can easily consume tens of thousands of tokens per query, and running this at scale adds up quickly.

A sliding window variant reduces this cost somewhat. Instead of ranking all $K$ candidates at once, we slide a window of size $w$ (say 20) over the list from bottom to top, reranking within each window and letting the most relevant documents bubble upward with each pass.

There is also a middle ground between pointwise and listwise. We can score each candidate individually by asking the LLM "Is this document relevant to the query? Answer yes or no" and using the log-probability of "yes" as the score (essentially monoT5 with an LLM backbone). This is cheaper than listwise reranking but loses the ability to compare candidates against each other.

The following simulation illustrates why reranking matters at all, showing how NDCG@k improves as we move from a cheap first-stage retriever (BM25) through a bi-encoder to a reranker that concentrates relevant documents near the very top of the list.

import math, json
import js

# Simulate cross-encoder vs first-stage retriever quality
# at different positions in the ranked list

def ndcg_at_k(relevances, k):
    """
    relevances: list of 0/1 labels in rank order (1=relevant)
    k: cutoff
    """
    dcg = sum(rel / math.log2(i + 2) for i, rel in enumerate(relevances[:k]))
    ideal = sorted(relevances, reverse=True)
    idcg = sum(rel / math.log2(i + 2) for i, rel in enumerate(ideal[:k]))
    return dcg / idcg if idcg > 0 else 0.0

# Simulate ranked lists from different systems for 50 queries
# We'll use synthetic relevance patterns

import random
random.seed(42)

def simulate_rankings(n_relevant=5, list_size=20, precision_boost=0):
    """
    Returns a ranked list of 0/1 relevance labels.
    Higher precision_boost = more relevant docs near the top.
    """
    # Place n_relevant relevant docs; higher precision_boost -> lower average rank
    all_docs = [0] * list_size
    positions = sorted(random.sample(range(list_size), n_relevant))
    # Apply boost: shift relevant docs toward top
    boosted = [max(0, p - precision_boost + random.randint(-1, 1)) for p in positions]
    boosted = [min(p, list_size - 1) for p in boosted]
    for p in boosted:
        all_docs[p] = 1
    return all_docs

n_queries = 30
k_values = [1, 3, 5, 10]

bm25_ndcg    = {k: [] for k in k_values}
bienc_ndcg   = {k: [] for k in k_values}
rerank_ndcg  = {k: [] for k in k_values}

for _ in range(n_queries):
    bm25_rels   = simulate_rankings(n_relevant=5, list_size=20, precision_boost=0)
    bienc_rels  = simulate_rankings(n_relevant=5, list_size=20, precision_boost=3)
    rerank_rels = simulate_rankings(n_relevant=5, list_size=20, precision_boost=8)
    for k in k_values:
        bm25_ndcg[k].append(ndcg_at_k(bm25_rels, k))
        bienc_ndcg[k].append(ndcg_at_k(bienc_rels, k))
        rerank_ndcg[k].append(ndcg_at_k(rerank_rels, k))

mean_bm25   = [round(sum(bm25_ndcg[k])/n_queries, 3)   for k in k_values]
mean_bienc  = [round(sum(bienc_ndcg[k])/n_queries, 3)  for k in k_values]
mean_rerank = [round(sum(rerank_ndcg[k])/n_queries, 3) for k in k_values]

plot_data = [
    {
        "title": "NDCG@k: BM25 vs Bi-Encoder vs Reranker (Simulated)",
        "x_label": "k (cutoff)",
        "y_label": "Mean NDCG@k",
        "x_data": [str(k) for k in k_values],
        "lines": [
            {"label": "BM25",       "data": mean_bm25,   "color": "#f59e0b"},
            {"label": "Bi-Encoder", "data": mean_bienc,  "color": "#3b82f6"},
            {"label": "Reranker",   "data": mean_rerank, "color": "#10b981"},
        ]
    }
]
js.window.py_plot_data = json.dumps(plot_data)

Quiz

Test your understanding of retrieval reranking.

Why is a cross-encoder used as a reranker rather than a first-stage retriever?

Cross-encoders produce lower-quality relevance scores

Cross-encoders require a forward pass per query-document pair, which is too slow at scale

Cross-encoders cannot handle long documents

Cross-encoders do not support batch processing

In the pairwise RankNet loss $-\log \sigma(s_{d^+} - s_{d^-})$, what does the model learn?

To predict the absolute relevance score of each document

To maximise the score difference between positive and negative documents

To cluster query and document embeddings in the same space

To minimise the number of retrieved documents

monoT5 computes relevance scores by:

Computing the dot product between query and document embeddings

Measuring the log-probability of generating the token "true" given the query-document prompt

Fine-tuning a classifier on top of the [CLS] token

Averaging the attention weights across all layers

What is the main practical limitation of RankGPT's listwise reranking approach?

LLMs cannot read multiple documents at once

Listwise reranking requires labelled training data

Processing 20 documents in context uses many tokens, making it expensive per query

It cannot handle more than 5 candidate documents

In a typical two-stage RAG pipeline with $K=100$ first-stage candidates and $k=5$ final results, where does the reranker operate?

It reranks the entire document corpus

It reranks the top-100 candidates from the first stage to select the final 5

It reranks the final 5 results after generation

It runs in parallel with the first-stage retriever