Sparse Retrieval: TF-IDF, BM25, and SPLADE

TF-IDF: Weighing What Matters

Let's start with one of the simplest retrieval ideas: count how many times each query word appears in each document, then return the document with the highest match. Say we have a corpus of machine-learning papers and a user searches for "transformer attention". Do we pick the document that matches the most distinct query words, or the one where fewer words match but they appear more often? And what about low-value words like "the" or "of" — should those count at all? TF-IDF answers each of these questions by combining two signals.

The first signal handles repetition. If "transformer" appears 10 times in one paper and once in another, the first is probably more about transformers. That's Term Frequency (TF). Raw counts work in principle, but a document that repeats a word 10 times isn't 10× more relevant than one that says it once, so most implementations log-scale the count to flatten out extreme values:

\text{TF}(t, d) = 1 + \log(1 + \text{count}(t, d))

💡 Why $1 + \log(1 + \text{count})$ and not just $\text{count}$? The inner $1+$ prevents $\log(0)$ when count is zero. The $\log$ compresses the scale: a word appearing 10 times scores $1 + \log(11) \approx 3.4$, not 10, so repetition helps but with sharply diminishing returns. And the outer $1+$ ensures that a word appearing at least once always scores above 1 (rather than some small fractional log value), giving it a clear boost over zero.

But frequency alone doesn't solve the "the" problem. Words like "the", "is", and "of" appear in virtually every document, so matching on them tells us nothing about what makes a document special. What we really want is to boost words that are rare across the corpus (words that distinguish one document from the rest), and that's what Inverse Document Frequency (IDF) does. Given $N$ total documents and $\text{df}(t)$ counting how many contain term $t$:

\text{IDF}(t) = \log \frac{N}{1 + \text{df}(t)}

Let's walk through what this formula actually does at the extremes. If a word appears in every document, $\text{df}(t) = N$, so we get $\log\frac{N}{N+1}$, which is slightly negative (e.g. $\log\frac{100}{101} \approx -0.01$). That means extremely common words like "the" contribute almost nothing (or even slightly reduce the score) when multiplied by TF. If a word appears in just one document, $\text{df}(t) = 1$, so we get $\log\frac{N}{2}$, which for a 100-document corpus is $\log 50 \approx 3.9$. Rare words get a large positive weight. The $1+$ in the denominator prevents division by zero if $\text{df}(t) = 0$ (a term not present in any document), and the $\log$ keeps the values bounded (without it, a rare term in a million-document corpus would score $1{,}000{,}000$ vs a common term's $\sim\!1$, completely dominating everything else).

When we multiply TF by IDF, we get a weight that's high when a term appears often in this particular document but rarely across the corpus. "The" gets crushed (IDF near zero) while "transformer" in a machine-learning corpus gets boosted (small fraction of documents). That's exactly the signal we want.

To actually score a query against a document, represent both as vectors over the full vocabulary — one weight per word, zero for words not present — and take their dot product:

\text{score}(q, d) = \sum_{t \in q \cap d} \text{TF-IDF}(t, q) \cdot \text{TF-IDF}(t, d)

To make this concrete, the code below computes TF, IDF, and the final TF-IDF score for each word in a small corpus. Notice how "the" (present in every document) gets an IDF near zero, while "transformer" (present in one) gets a high IDF and dominates the final score.

import math, json, js
from collections import Counter

corpus = [
    "the transformer model uses the attention mechanism",
    "the neural network is trained on the data",
    "transformer architectures revolutionised NLP",
]
query = "transformer attention"

def tokenize(text):
    return text.lower().split()

N = len(corpus)
all_tokens = [tokenize(d) for d in corpus]

# Compute df for each term
df = {}
for doc_tokens in all_tokens:
    for t in set(doc_tokens):
        df[t] = df.get(t, 0) + 1

# Show TF, IDF, TF-IDF for query terms + "the" in Doc 1
doc_tokens = all_tokens[0]
tf_counts = Counter(doc_tokens)

terms = tokenize(query) + ["the"]
rows = []
for t in terms:
    count = tf_counts.get(t, 0)
    tf_val = 1 + math.log(1 + count) if count > 0 else 0
    df_val = df.get(t, 0)
    idf_val = math.log(N / (1 + df_val))
    tfidf = tf_val * idf_val
    rows.append([t, str(count), f"{tf_val:.3f}", str(df_val), f"{idf_val:.3f}", f"{tfidf:.3f}"])

js.window.py_table_data = json.dumps({
    "headers": ["Term", "Count", "TF", "df", "IDF", "TF*IDF"],
    "rows": rows
})

print(f"Corpus size N = {N}")
print(f"Query: '{query}'")
print(f"Doc 1: '{corpus[0]}'")
print()
print("'the' appears twice but in all 3 docs => IDF crushes its weight")
print("'transformer' appears once but only in 2 docs => higher IDF, higher final score")

Notice the sum only runs over words that appear in both the query and the document. Most words appear in neither, so these vectors are extremely sparse. That sparsity is what makes retrieval fast, because we can use an inverted index : instead of storing document → words (a normal index), we flip it to word → documents. When a query arrives, we look up only the posting lists for its terms. Common words like "the" have very long posting lists, but modern search engines use pruning algorithms like WAND (Broder et al., 2003) and Block-Max WAND (Ding & Suel, 2011) that skip entries that cannot make it into the top-k, so even long posting lists are traversed efficiently.

import json, js
from collections import defaultdict

# Normal index: each document maps to its words
normal = {
    "doc_1": ["transformer", "attention", "encoder"],
    "doc_2": ["gradient", "descent", "loss"],
    "doc_3": ["attention", "softmax", "decoder"],
    "doc_4": ["tokenizer", "vocabulary", "subword"],
    "doc_5": ["loss", "cross", "entropy"],
}

# Inverted index: each word maps to its documents
inverted = defaultdict(list)
for doc, words in normal.items():
    for w in words:
        inverted[w].append(doc)

# Build table
rows = []
for term, docs in sorted(inverted.items()):
    rows.append([term, ", ".join(docs)])

js.window.py_table_data = json.dumps({
    "headers": ["Term", "Posting List (documents)"],
    "rows": rows
})

# Query "attention decoder" → only 2 posting lists
query_terms = ["attention", "decoder"]
candidates = set()
for t in query_terms:
    candidates.update(inverted.get(t, []))
print(f"Query 'attention decoder'")
print(f"  Candidates: {sorted(candidates)}")
print(f"  Skipped:    {sorted(set(normal) - candidates)}")

💡 A vocabulary of 100,000 terms means each document vector has 100,000 dimensions, but typically fewer than 200 are non-zero. The inverted index lets us skip all the zeros and only touch documents that share at least one term with the query.

BM25: Saturation and Length Normalisation

TF-IDF works, but on a real corpus two problems become obvious. First, term frequency grows without limit: a document mentioning "transformer" many times scores much higher than one mentioning it once, even though it's probably not that much more relevant. The log in TF dampens this, but not enough. Second, long documents accumulate more words and tend to score higher just because they're longer — a 10,000-word legal filing will outscore a 200-word abstract on almost any query, even when the abstract is more on-topic.

BM25 (Best Match 25) (Robertson et al., 1994) fixes both problems in one formula:

\text{BM25}(q, d) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{\text{count}(t,d) \cdot (k_1 + 1)}{\text{count}(t,d) + k_1 \cdot \left(1 - b + b \cdot \frac{|d|}{\text{avgdl}}\right)}

Two parameters control how aggressively BM25 applies these corrections:

$k_1 \in [1.2, 2.0]$ caps term frequency. As count$(t,d)$ grows, the score approaches a finite ceiling of $(k_1 + 1) \cdot \text{IDF}(t)$ instead of climbing forever. Mentioning "transformer" 10 times barely helps more than mentioning it 3 times. That's the saturation TF-IDF was missing.
$b \in [0, 1]$ penalises length. The denominator divides document length $|d|$ by the corpus average $\text{avgdl}$. With $b=0.75$ (the standard default), a document twice the average length needs proportionally stronger term matches to score as well as a shorter one. With $b=0$, length is ignored entirely.

Consider a 10,000-word legal filing and a 200-word abstract, both mentioning the query term twice. Without length normalisation, TF-IDF treats them equally (same raw count). With $b=0.75$, BM25 shrinks the legal document's score because two mentions in 10,000 words is far less concentrated than two mentions in 200.

BM25 remains the dominant sparse baseline (Elasticsearch and OpenSearch use it by default, and when people say "keyword search" they almost always mean BM25). The code below compares BM25 and TF-IDF scores on a small corpus so we can see how saturation and length normalisation change the ranking.

import math, json
import js

# ---- BM25 implementation ----
def tokenize(text):
    return text.lower().split()

def build_idf(corpus):
    N = len(corpus)
    df = {}
    for doc in corpus:
        for t in set(tokenize(doc)):
            df[t] = df.get(t, 0) + 1
    return {t: math.log((N - n + 0.5) / (n + 0.5) + 1) for t, n in df.items()}

def bm25_score(query, doc, idf, avgdl, k1=1.5, b=0.75):
    tokens = tokenize(doc)
    dl = len(tokens)
    tf = {}
    for t in tokens:
        tf[t] = tf.get(t, 0) + 1
    score = 0.0
    for t in set(tokenize(query)):
        if t not in idf:
            continue
        f = tf.get(t, 0)
        score += idf[t] * (f * (k1 + 1)) / (f + k1 * (1 - b + b * dl / avgdl))
    return score

def tfidf_score(query, doc, idf):
    tokens = tokenize(doc)
    tf = {}
    for t in tokens:
        tf[t] = tf.get(t, 0) + 1
    # Log-normalised TF
    score = 0.0
    for t in set(tokenize(query)):
        if t not in idf or t not in tf:
            continue
        score += (1 + math.log(1 + tf[t])) * idf[t]
    return score

corpus = [
    "BM25 uses term frequency saturation and length normalisation",
    "TF-IDF weighs terms by how rare they are across the corpus",
    "neural networks learn dense vector representations for retrieval",
    "the inverted index enables fast sparse retrieval over large corpora",
    "BM25 is the standard baseline for sparse retrieval in information retrieval",
    "length normalisation in BM25 prevents long documents from dominating",
    "TF-IDF and BM25 both rely on an inverted index for efficiency",
    "sparse retrieval methods match exact keywords in query and document",
]

query = "BM25 sparse retrieval length normalisation"
idf = build_idf(corpus)
avgdl = sum(len(tokenize(d)) for d in corpus) / len(corpus)

bm25_scores  = [bm25_score(query, d, idf, avgdl) for d in corpus]
tfidf_scores = [tfidf_score(query, d, idf) for d in corpus]

labels = [f"Doc {i+1}" for i in range(len(corpus))]

plot_data = [
    {
        "title": "BM25 vs TF-IDF Scores",
        "x_label": "Document",
        "y_label": "Score",
        "x_data": labels,
        "lines": [
            {"label": "BM25",    "data": [round(s, 3) for s in bm25_scores],  "color": "#3b82f6"},
            {"label": "TF-IDF",  "data": [round(s, 3) for s in tfidf_scores], "color": "#f59e0b"},
        ]
    }
]
js.window.py_plot_data = json.dumps(plot_data)

The Vocabulary Mismatch Problem

Every method we've built so far shares one blind spot: they only match documents that use the exact same words as the query. Search for "cardiac arrest" and BM25 returns nothing if every relevant document says "heart attack" instead. The scoring formula could be perfect and it wouldn't matter — zero shared terms means zero score.

This vocabulary mismatch problem takes several forms:

Synonymy: "automobile" vs "car", "begin" vs "start" — same concept, different surface form.
Paraphrase: "How does a transformer work?" misses documents titled "Self-attention mechanism explained".
Morphological variation: Without stemming, "running" misses "ran" and "runs".
Domain jargon: Medical abbreviations, product codes, brand names — terms that have equivalents in everyday language but don't share any characters.

There are manual fixes: add synonym dictionaries, apply stemming ("running" → "run"), or use pseudo-relevance feedback (take the top results, extract their key terms, re-run the query with those terms added). These help, but they're brittle — synonym lists go stale, stemming misfires on technical vocabulary ("transformer" the model vs "transformer" the electrical device), and feedback loops can amplify noise from bad initial results.

SPLADE: Learning Sparse Representations

What if, instead of building synonym lists by hand, we trained a model to expand the vocabulary automatically? Given the query "cardiac arrest", could a neural network learn to also activate "heart", "attack", "coronary", "myocardial" — adding synonyms it learned from data, without anyone writing a dictionary?

That's what SPLADE does (Formal et al., 2021) . It repurposes BERT's masked language model (MLM) head — the part of BERT that predicts which word should fill a [MASK] slot. For each token position $i$ in the input, the MLM head produces a score $h_{ij}$ for every word $j$ in the vocabulary. SPLADE takes the maximum across all positions and applies log-saturation:

w_j(d) = \log\!\left(1 + \text{ReLU}\left(\max_{i} \, h_{ij}\right)\right)

The max over positions picks the strongest signal for each vocabulary term. ReLU ensures no negative weights. And the $\log(1 + \cdot)$ applies the same saturation idea we saw in BM25: diminishing returns for very strong activations, so no single term dominates the score.

The result is a vector over the vocabulary where related terms light up even if they never appeared in the original text. For "cardiac arrest", the model might assign high weights to "heart", "attack", "coronary", and "myocardial" — all learned from training data, no dictionary required.

There's a catch, though. Without constraints, the model tends to activate nearly every vocabulary term to some degree — the vectors stop being sparse. And we need sparsity, because sparse vectors are what let us use the same inverted-index infrastructure that makes BM25 fast. SPLADE enforces sparsity with a FLOPS regularisation term in the training loss:

\mathcal{L}_{\text{FLOPS}} = \lambda \sum_{j} \left(\frac{1}{|\mathcal{B}|} \sum_{d \in \mathcal{B}} w_j(d)\right)^2

This term is added to the main contrastive training loss (which teaches the model to score relevant query-document pairs higher than irrelevant ones). The combined loss becomes $\mathcal{L} = \mathcal{L}_{\text{contrastive}} + \mathcal{L}_{\text{FLOPS}}$. Without the FLOPS term, the model would learn good relevance scores but produce dense vectors; with it, the model is pushed to keep activations sparse — only lighting up vocabulary terms that genuinely matter for a given document. The hyperparameter $\lambda$ controls this trade-off: higher $\lambda$ means sparser vectors (faster retrieval, but potentially less vocabulary expansion).

The name "FLOPS" comes from the fact that the number of floating-point operations during index lookup is proportional to how many non-zero terms overlap between query and document vectors. Sparser vectors intersect over fewer terms, so reducing average activation directly cuts lookup cost.

At retrieval time, SPLADE works exactly like BM25: sparse vectors stored in an inverted index, scored with a dot product over shared terms. The difference is that "shared terms" now includes words that neither the original query nor the document contained — the model added them.

💡 SPLADE-v2 (Formal et al., 2022) splits the model so that vocabulary expansion happens at query time only, keeping index construction lightweight. This matters in production because re-encoding millions of documents every time we update the model is expensive.

Quiz

Test your understanding of sparse retrieval methods.

In BM25, what does the parameter $k_1$ control?

The length normalisation strength

The term frequency saturation ceiling

The inverse document frequency smoothing

The minimum document frequency threshold

Why does BM25 outperform raw TF-IDF on long documents?

It uses a larger vocabulary

It applies neural network embeddings

It normalises scores by document length relative to the corpus average

It ignores stop words automatically

A query for "automobile fuel efficiency" returns no results because the documents use the term "car mpg". This is an example of:

TF-IDF score saturation

Vocabulary mismatch

Insufficient length normalisation

Inverted index corruption

What does the FLOPS regularisation term in SPLADE penalise?

Vocabulary terms with high IDF scores

Terms that appear in very long documents

Vocabulary terms that activate frequently across many documents in a batch

The total number of vocabulary terms in the model

What key property allows SPLADE vectors to be used with a standard inverted index, just like BM25?

SPLADE vectors are dense and high-dimensional

SPLADE uses the same IDF formula as BM25

SPLADE vectors are sparse, with most vocabulary weights set to zero

SPLADE does not require any training data