TF-IDF: Weighing What Matters
Let's start with one of the simplest retrieval ideas: count how many times each query word appears in each document, then return the document with the highest match. But what counts as "highest"? Do we pick the document that matches the most distinct query words (say, 5 out of 7), or the one where fewer words match but "transformer" appears 50 times? And what about low-value words like "the" or "of" — should those count at all? TF-IDF answers each of these questions by combining two signals.
The first signal handles repetition. If "transformer" appears 10 times in one document and once in another, the first is probably more about transformers. That's Term Frequency (TF). Raw counts work in principle, but a document that repeats a word 50 times isn't 50× more relevant than one that says it once, so most implementations log-scale the count to flatten out extreme values:
But frequency alone doesn't solve the "the" problem. Words like "the", "is", and "of" appear in virtually every document, so matching on them tells us nothing about what makes a document special. What we really want is to boost words that are rare across the corpus (words that distinguish one document from the rest), and that's what Inverse Document Frequency (IDF) does. Given $N$ total documents and $\text{df}(t)$ counting how many contain term $t$:
Let's walk through what this formula actually does at the extremes. If a word appears in every document, $\text{df}(t) = N$, so we get $\log\frac{N}{N+1}$, which is slightly negative (e.g. $\log\frac{100}{101} \approx -0.01$). That means extremely common words like "the" contribute almost nothing (or even slightly reduce the score) when multiplied by TF. If a word appears in just one document, $\text{df}(t) = 1$, so we get $\log\frac{N}{2}$, which for a 100-document corpus is $\log 50 \approx 3.9$. Rare words get a large positive weight. The $1+$ in the denominator prevents division by zero if $\text{df}(t) = 0$ (a term not present in any document), and the $\log$ keeps the values bounded (without it, a rare term in a million-document corpus would score $1{,}000{,}000$ vs a common term's $\sim\!1$, completely dominating everything else).
When we multiply TF by IDF, we get a weight that's high when a term appears often in this particular document but rarely across the corpus. "The" gets crushed (IDF near zero) while "transformer" in a machine-learning corpus gets boosted (small fraction of documents). That's exactly the signal we want.
To actually score a query against a document, represent both as vectors over the full vocabulary — one weight per word, zero for words not present — and take their dot product:
To make this concrete, the code below computes TF, IDF, and the final TF-IDF score for each word in a small corpus. Notice how "the" (present in every document) gets an IDF near zero, while "transformer" (present in one) gets a high IDF and dominates the final score.
import math
from collections import Counter
corpus = [
"the transformer model uses the attention mechanism",
"the neural network is trained on the data",
"transformer architectures revolutionised NLP",
]
query = "transformer attention"
def tokenize(text):
return text.lower().split()
N = len(corpus)
all_tokens = [tokenize(d) for d in corpus]
# Compute df for each term
df = {}
for doc_tokens in all_tokens:
for t in set(doc_tokens):
df[t] = df.get(t, 0) + 1
# Show TF, IDF, TF-IDF for query terms in Doc 1
doc_tokens = all_tokens[0]
tf_counts = Counter(doc_tokens)
print(f"Corpus size N = {N}")
print(f"Query: '{query}'")
print(f"Doc 1: '{corpus[0]}'")
print()
print(f"{'Term':<15} {'count':>5} {'TF':>8} {'df':>4} {'IDF':>8} {'TF*IDF':>8}")
print("-" * 52)
for t in tokenize(query):
count = tf_counts.get(t, 0)
tf_val = 1 + math.log(1 + count) if count > 0 else 0
df_val = df.get(t, 0)
idf_val = math.log(N / (1 + df_val))
tfidf = tf_val * idf_val
print(f"{t:<15} {count:>5} {tf_val:>8.3f} {df_val:>4} {idf_val:>8.3f} {tfidf:>8.3f}")
# "the" for comparison
t = "the"
count = tf_counts.get(t, 0)
tf_val = 1 + math.log(1 + count) if count > 0 else 0
df_val = df.get(t, 0)
idf_val = math.log(N / (1 + df_val))
tfidf = tf_val * idf_val
print(f"{t:<15} {count:>5} {tf_val:>8.3f} {df_val:>4} {idf_val:>8.3f} {tfidf:>8.3f}")
print()
print("'the' appears twice in Doc 1 but in all 3 docs => IDF crushes its weight")
print("'transformer' appears once but only in 2 docs => higher IDF, higher final score")
Notice the sum only runs over words that appear in both the query and the document. Most words appear in neither, so these vectors are extremely sparse (that sparsity is what makes TF-IDF fast). An inverted index maps each word to the list of documents containing it, so scoring a query only touches the posting lists for its terms, not every document in the corpus.
BM25: Saturation and Length Normalisation
TF-IDF works, but try it on a real corpus and two problems become obvious. First, term frequency grows without limit: a document mentioning "neural" 50 times scores much higher than one mentioning it once, even though it's probably not that much more relevant. The log in TF dampens this, but not enough. Second, long documents accumulate more words and tend to score higher just because they're longer — a 10,000-word legal filing will outscore a 200-word abstract on almost any query, even when the abstract is more on-topic.
BM25 (Best Match 25) (Robertson et al., 1994) fixes both problems in one formula:
Two parameters control how aggressively BM25 applies these corrections:
- caps term frequency. As count$(t,d)$ grows, the score approaches a finite ceiling of $(k_1 + 1) \cdot \text{IDF}(t)$ instead of climbing forever. Saying "neural" 50 times barely helps more than saying it 10 times. That's the saturation we were missing.
- penalises length. The denominator divides document length $|d|$ by the corpus average $\text{avgdl}$. With $b=0.75$ (the standard default), a document twice the average length needs proportionally stronger term matches to score as well as a shorter one. With $b=0$, length is ignored entirely.
Consider a 10,000-word legal filing and a 200-word abstract, both mentioning the query term twice. Without length normalisation, TF-IDF treats them equally (same raw count). With $b=0.75$, BM25 shrinks the legal document's score because two mentions in 10,000 words is far less concentrated than two mentions in 200.
BM25 remains the dominant sparse baseline (Elasticsearch and OpenSearch use it by default, and when people say "keyword search" they almost always mean BM25). The code below compares BM25 and TF-IDF scores on a small corpus so we can see how saturation and length normalisation change the ranking.
import math, json
import js
# ---- BM25 implementation ----
def tokenize(text):
return text.lower().split()
def build_idf(corpus):
N = len(corpus)
df = {}
for doc in corpus:
for t in set(tokenize(doc)):
df[t] = df.get(t, 0) + 1
return {t: math.log((N - n + 0.5) / (n + 0.5) + 1) for t, n in df.items()}
def bm25_score(query, doc, idf, avgdl, k1=1.5, b=0.75):
tokens = tokenize(doc)
dl = len(tokens)
tf = {}
for t in tokens:
tf[t] = tf.get(t, 0) + 1
score = 0.0
for t in set(tokenize(query)):
if t not in idf:
continue
f = tf.get(t, 0)
score += idf[t] * (f * (k1 + 1)) / (f + k1 * (1 - b + b * dl / avgdl))
return score
def tfidf_score(query, doc, idf):
tokens = tokenize(doc)
tf = {}
for t in tokens:
tf[t] = tf.get(t, 0) + 1
# Log-normalised TF
score = 0.0
for t in set(tokenize(query)):
if t not in idf or t not in tf:
continue
score += (1 + math.log(1 + tf[t])) * idf[t]
return score
corpus = [
"BM25 uses term frequency saturation and length normalisation",
"TF-IDF weighs terms by how rare they are across the corpus",
"neural networks learn dense vector representations for retrieval",
"the inverted index enables fast sparse retrieval over large corpora",
"BM25 is the standard baseline for sparse retrieval in information retrieval",
"length normalisation in BM25 prevents long documents from dominating",
"TF-IDF and BM25 both rely on an inverted index for efficiency",
"sparse retrieval methods match exact keywords in query and document",
]
query = "BM25 sparse retrieval length normalisation"
idf = build_idf(corpus)
avgdl = sum(len(tokenize(d)) for d in corpus) / len(corpus)
bm25_scores = [bm25_score(query, d, idf, avgdl) for d in corpus]
tfidf_scores = [tfidf_score(query, d, idf) for d in corpus]
labels = [f"Doc {i+1}" for i in range(len(corpus))]
plot_data = [
{
"title": "BM25 vs TF-IDF Scores",
"x_label": "Document",
"y_label": "Score",
"x_data": labels,
"lines": [
{"label": "BM25", "data": [round(s, 3) for s in bm25_scores], "color": "#3b82f6"},
{"label": "TF-IDF", "data": [round(s, 3) for s in tfidf_scores], "color": "#f59e0b"},
]
}
]
js.window.py_plot_data = json.dumps(plot_data)
The Vocabulary Mismatch Problem
Every method we've built so far shares one blind spot: they only match documents that use the exact same words as the query. Search for "cardiac arrest" and BM25 returns nothing if every relevant document says "heart attack" instead. The scoring formula could be perfect and it wouldn't matter — zero shared terms means zero score.
This vocabulary mismatch problem takes several forms:
- "automobile" vs "car", "begin" vs "start" — same concept, different surface form.
- "How does a transformer work?" misses documents titled "Self-attention mechanism explained".
- Without stemming, "running" misses "ran" and "runs".
- Medical abbreviations, product codes, brand names — terms that have equivalents in everyday language but don't share any characters.
There are manual fixes: add synonym dictionaries, apply stemming ("running" → "run"), or use pseudo-relevance feedback (take the top results, extract their key terms, re-run the query with those terms added). These help, but they're brittle — synonym lists go stale, stemming misfires on technical vocabulary ("transformer" the model vs "transformer" the electrical device), and feedback loops can amplify noise from bad initial results.
SPLADE: Learning Sparse Representations
What if, instead of building synonym lists by hand, we trained a model to expand the vocabulary automatically? Given the query "cardiac arrest", could a neural network learn to also activate "heart", "attack", "coronary", "myocardial" — adding synonyms it learned from data, without anyone writing a dictionary?
That's what SPLADE does (Formal et al., 2021) . It repurposes BERT's masked language model (MLM) head — the part of BERT that predicts which word should fill a [MASK] slot. For each token position $i$ in the input, the MLM head produces a score $h_{ij}$ for every word $j$ in the vocabulary. SPLADE takes the maximum across all positions and applies log-saturation:
The max over positions picks the strongest signal for each vocabulary term. ReLU ensures no negative weights. And the $\log(1 + \cdot)$ applies the same saturation idea we saw in BM25: diminishing returns for very strong activations, so no single term dominates the score.
The result is a vector over the vocabulary where related terms light up even if they never appeared in the original text. For "cardiac arrest", the model might assign high weights to "heart", "attack", "coronary", and "myocardial" — all learned from training data, no dictionary required.
There's a catch, though. Without constraints, the model tends to activate nearly every vocabulary term to some degree — the vectors stop being sparse. And we need sparsity, because sparse vectors are what let us use the same inverted-index infrastructure that makes BM25 fast. SPLADE enforces sparsity with a FLOPS regularisation term in the training loss:
This penalises vocabulary terms that activate across many documents in a training batch. The model learns to activate a term only when it genuinely matters for that document, keeping the vectors sparse enough for efficient retrieval while still expanding the vocabulary where it helps. The hyperparameter $\lambda$ controls this trade-off: higher $\lambda$ means sparser vectors (faster retrieval, but potentially less vocabulary expansion).
The name "FLOPS" comes from the fact that the number of floating-point operations during index lookup is proportional to how many non-zero terms overlap between query and document vectors. Sparser vectors intersect over fewer terms, so reducing average activation directly cuts lookup cost.
At retrieval time, SPLADE works exactly like BM25: sparse vectors stored in an inverted index, scored with a dot product over shared terms. The difference is that "shared terms" now includes words that neither the original query nor the document contained — the model added them.
Quiz
Test your understanding of sparse retrieval methods.
In BM25, what does the parameter $k_1$ control?
Why does BM25 outperform raw TF-IDF on long documents?
A query for "automobile fuel efficiency" returns no results because the documents use the term "car mpg". This is an example of:
What does the FLOPS regularisation term in SPLADE penalise?
What key property allows SPLADE vectors to be used with a standard inverted index, just like BM25?