Why Combine Retrievers?

By now we have two retrieval approaches that are good at very different things. Sparse retrieval (BM25, SPLADE) excels at exact keyword matching and rare technical terms, while dense retrieval (bi-encoders) excels at semantic paraphrase and synonym handling. Neither is strictly better, so why not use both?

An empirical observation bears this out. When we compute the overlap between the top-10 results from BM25 and the top-10 from a bi-encoder on the same query, the overlap tends to be moderate (roughly 40-60% on standard IR benchmarks, though the exact figure varies considerably by dataset and domain). Each system finds relevant documents the other misses, and a combined system can capture both sets.

The challenge is figuring out how to combine two ranked lists that come from different scoring functions with different scales. BM25 scores are log-probability-derived real numbers, while dense scores are cosine similarities in [-1, 1]. We cannot simply add them because a BM25 score of 12 and a cosine score of 0.8 are not comparable. So how do we merge these lists fairly?

Reciprocal Rank Fusion

Reciprocal Rank Fusion (Cormack et al., 2009) (RRF) sidesteps the score calibration problem entirely by working only with ranks, not scores. For a document $d$ that appears at rank $r$ in ranked list $\ell$, its RRF contribution is $\frac{1}{k + r}$. Summed across all lists:

$$\text{RRF}(d) = \sum_{\ell \in L} \frac{1}{k + r_\ell(d)}$$

The constant $k$ (typically 60) acts as a floor that prevents a single top-ranked result from dominating the fused score. Even the top-ranked document from one system contributes at most $\frac{1}{61}$, and documents not in a particular list are assigned $r = \infty$, contributing 0 from that list.

To see why this weighting matters, consider two documents. Document A is ranked 2nd by BM25 and 3rd by the dense retriever, while document B is ranked 1st by BM25 but only 50th by the dense retriever. Without fusion we might prefer B (it has a rank-1 result), but RRF tells a different story. Document A scores $\frac{1}{62} + \frac{1}{63} \approx 0.0320$, while document B scores $\frac{1}{61} + \frac{1}{110} \approx 0.0255$. The document that both systems agree on wins, even though neither system ranked it first.

RRF also tends to be robust to the choice of retrieval systems and hyperparameters. The original paper showed that $k=60$ worked well across many retrieval benchmarks without tuning, and more recent work confirms it often outperforms weighted linear combination of normalised scores because score normalisation is highly sensitive to the score distribution of each retrieval system.

The following implementation shows RRF in action on two simulated retrieval lists with partial overlap, along with a plot breaking down each document's contribution from each system.

import math, json
import js

def rrf_fuse(ranked_lists, k=60):
    """
    ranked_lists: list of lists of doc_ids, ordered by relevance (best first)
    Returns: sorted list of (doc_id, rrf_score) tuples
    """
    scores = {}
    for lst in ranked_lists:
        for rank, doc_id in enumerate(lst, start=1):
            scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank)
    return sorted(scores.items(), key=lambda x: -x[1])

# Simulate two retrieval systems with partial overlap
# Docs 1-5 are relevant (suppose); systems disagree on order
bm25_results   = ["doc3", "doc7", "doc1", "doc9", "doc5", "doc2", "doc11", "doc4", "doc8", "doc6"]
dense_results  = ["doc1", "doc5", "doc3", "doc12", "doc2", "doc8", "doc6", "doc10", "doc4", "doc7"]

fused = rrf_fuse([bm25_results, dense_results], k=60)
fused_docs   = [d for d, _ in fused[:10]]
fused_scores = [round(s, 4) for _, s in fused[:10]]

# Compute rank for each doc in BM25 and dense (0 if not in top-10)
def get_rank(lst, doc):
    return lst.index(doc) + 1 if doc in lst else None

all_docs_in_fused = fused_docs[:8]
bm25_ranks  = [get_rank(bm25_results,  d) or 0 for d in all_docs_in_fused]
dense_ranks = [get_rank(dense_results, d) or 0 for d in all_docs_in_fused]
rrf_scores  = [round(1/(60+r) if r > 0 else 0, 4) for r in bm25_ranks]

# We'll plot the RRF scores of top-8 fused docs
plot_data = [
    {
        "title": "RRF Scores for Top-8 Fused Results",
        "x_label": "Document",
        "y_label": "RRF Score",
        "x_data": all_docs_in_fused,
        "lines": [
            {"label": "RRF Score", "data": [round(s, 4) for s in fused_scores[:8]], "color": "#10b981"},
            {"label": "BM25 contribution", "data": [round(1/(60+r), 4) if r > 0 else 0 for r in bm25_ranks], "color": "#3b82f6"},
            {"label": "Dense contribution", "data": [round(1/(60+r), 4) if r > 0 else 0 for r in dense_ranks], "color": "#f59e0b"},
        ]
    }
]
js.window.py_plot_data = json.dumps(plot_data)

Score Normalisation Approaches

If RRF throws away the actual scores, a natural question is whether we could instead normalise both score distributions to the same scale and add them. That was the standard approach before RRF, and it is worth understanding why it turned out to be fragile. The two most common normalisations are min-max and z-score.

  • rescales each score as $s' = \frac{s - s_{\min}}{s_{\max} - s_{\min}}$. It is simple but sensitive to outliers (a single very high BM25 score compresses all other scores near zero).
  • centres and scales by standard deviation, $s' = \frac{s - \mu}{\sigma}$. It is more robust to outliers, but the resulting scores can be negative and usually need a sigmoid transformation to map back to [0, 1].

The fundamental problem with both approaches is that the normalisation is computed over the returned candidates, which change from query to query. A BM25 score of 12 might be low in one query context and high in another, so score-based fusion requires the scores to be calibrated (the same absolute score should represent the same level of relevance across queries). Achieving that kind of calibration requires careful model training and is rarely guaranteed in practice.

RRF avoids all of this because ranks are inherently calibrated. Rank 1 always means "best result from this retriever for this query", regardless of the raw score behind it.

Hybrid Retrieval in Production

Most vector databases (Weaviate, Qdrant, Milvus, among others) now support hybrid search as a built-in feature. A single query triggers both a BM25 (or TF-IDF) search and an ANN (approximate nearest neighbour) search, and the results are fused with RRF before being returned to the caller. Elasticsearch and OpenSearch expose a similar capability through a `hybrid` query type, and Azure AI Search bundles BM25 retrieval, vector retrieval, and an optional BERT reranker into one API call.

For most RAG systems, the practical path tends to look like this.

  • Start with BM25 alone for the simplest possible baseline.
  • Add a dense retriever when semantic paraphrase failures appear (users searching differently from how documents are written).
  • Fuse with RRF using $k=60$, and only tune $k$ if we have labelled evaluation data and the tuning improves a measured metric (NDCG@10, Recall@10).
  • Add a reranker on top of the fused top-K candidates if precision on the top result matters.
๐Ÿ’ก Learned fusion (training a small model to weight the contributions of each retriever) can outperform RRF but requires labelled data and adds deployment complexity. RRF's parameter-free simplicity usually wins in practice absent a large evaluation dataset.

Quiz

Test your understanding of hybrid search and rank fusion.

Why can't you simply add BM25 and dense retrieval scores to combine two retrievers?

In RRF with $k=60$, what does the constant $k$ accomplish?

Document A is ranked 1st by BM25 and not retrieved by dense search. Document B is ranked 5th by both. With RRF ($k=60$), which scores higher?

What is the main weakness of min-max normalisation for score-based fusion?

When is it worth training a learned fusion model instead of using RRF?