What Are Queries, Keys, and Values?

The previous article ended with a question: if we remove the RNN and rely on attention alone, how does each position decide which other positions matter? Bahdanau attention used the decoder's hidden state to query the encoder, but in a transformer there are no hidden states built through recurrence. Instead, every position in the sequence gets three different representations — a query , a key , and a value — each produced by a separate learned linear projection of the same input embedding.

Given an input matrix $X \in \mathbb{R}^{T \times d_{\text{model}}}$ (where $T$ is the sequence length and $d_{\text{model}}$ is the embedding dimension), we compute:

$$Q = XW^Q, \quad K = XW^K, \quad V = XW^V$$

where $W^Q, W^K \in \mathbb{R}^{d_{\text{model}} \times d_k}$ and $W^V \in \mathbb{R}^{d_{\text{model}} \times d_v}$. Each row of $Q$ is a query vector for one position, each row of $K$ is a key vector, and each row of $V$ is a value vector.

The intuition behind this separation is worth sitting with, because it's easy to gloss over. A query encodes what a position is looking for (it represents the question "which other positions are relevant to me?"). A key encodes what a position contains (it represents the answer "here's what I have to offer."). A value encodes what information a position actually carries (it's the payload that gets passed along once relevance is established). The query and key interact to determine how much attention to pay, and the value determines what information gets transferred.

Why three separate projections instead of just using the raw embeddings? Because the same token might need to be found by very different queries depending on context. The word "bank" in "river bank" and "bank account" should probably have similar keys when a query is about locations but different keys when a query is about finances. Having separate learned projections lets the model create specialised representations for each role, and the query space, the key space, and the value space can each capture different aspects of meaning that serve different purposes in the attention computation.

💡 Think of it like a library. The query is the search term we type in. The key is the metadata tag on each book (title, subject, keywords). The value is the actual content of the book. We match our search term against metadata to find relevant books, then read the content of the ones that matched. Separating the "what to match on" from "what to retrieve" gives the system flexibility that a single representation can't provide.

Why Dot Products, and Why Scale Them?

Once we have queries and keys, we need a way to measure how well they match. The simplest choice is a dot product: for query $q_i$ (a row of $Q$) and key $k_j$ (a row of $K$), the raw attention score is $q_i \cdot k_j = \sum_{m=1}^{d_k} q_{im} k_{jm}$. Two vectors pointing in similar directions yield a large positive dot product; orthogonal vectors yield zero; opposing vectors yield a large negative value. This is fast, parameter-free, and captures the similarity signal we need.

Computing all pairwise dot products at once gives us the full attention score matrix. For $T$ tokens with $d_k$-dimensional queries and keys:

$$S = QK^\top \in \mathbb{R}^{T \times T}$$

Entry $S_{ij}$ tells us how much position $i$'s query matches position $j$'s key. But these raw scores have a problem that becomes severe as $d_k$ grows. If the entries of $Q$ and $K$ are drawn roughly from a standard normal distribution (mean 0, variance 1), each dot product is a sum of $d_k$ independent products, so the variance of $S_{ij}$ scales linearly with $d_k$:

$$\text{Var}(q_i \cdot k_j) = \text{Var}\!\left(\sum_{m=1}^{d_k} q_{im} k_{jm}\right) = d_k$$

With $d_k = 64$ (typical for a single attention head), dot products have a standard deviation of $\sqrt{64} = 8$. With $d_k = 512$, the standard deviation jumps to about 22.6. Larger dot products in absolute value means the softmax in the next step will push almost all its probability mass onto one or two entries, producing attention weights that are nearly one-hot. Why is that bad? Because the gradients of softmax in the saturated regime are extremely small (close to zero for nearly all entries). Training slows to a crawl or stalls entirely, because the model can't adjust which positions to attend to.

The fix from Vaswani et al. (2017) is to divide by $\sqrt{d_k}$ before applying softmax:

$$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V$$

Let's examine what each piece of this formula does by considering what would break if we removed it.

Without the $\frac{1}{\sqrt{d_k}}$ scaling: as we just discussed, the dot products grow with dimension, softmax saturates, and gradients vanish. The model effectively makes hard attention decisions from the start of training, before it has learned anything useful, and then can't update those decisions because the gradients are too small.

Without the softmax: the attention weights wouldn't sum to 1, so the output would be a linear combination with arbitrary (possibly negative) coefficients. Softmax ensures the weights form a valid probability distribution, which means the output is a convex combination of value vectors, which lies within the convex hull of the values, keeping the output scale bounded and interpretable as "how much to attend to each position."

Without the multiplication by $V$: we'd have attention weights but nothing to attend to . The weights tell us which positions matter, but the values carry the actual information. Without $V$, the output of the attention layer would be the weight matrix itself (a $T \times T$ matrix of scalars), not a sequence of $d_v$-dimensional vectors that can feed into the next layer.

💡 The $\sqrt{d_k}$ factor normalises the variance of the dot products back to approximately 1, regardless of $d_k$. After dividing, $\text{Var}(S_{ij} / \sqrt{d_k}) = d_k / d_k = 1$. This keeps the softmax inputs in a moderate range where the distribution has entropy (it can spread probability across multiple positions) and gradients flow.

To see this concretely, consider the extreme case where $d_k = 1$. Then each "dot product" is just the product of two scalars. Variance is 1, softmax inputs are moderate, and everything works fine, so we don't need scaling. Now increase to $d_k = 512$: without scaling, a typical dot product might be $\pm 22$, and $\text{softmax}([22, -3, 1, -5])$ gives nearly $[1, 0, 0, 0]$ (all mass on one entry, no useful gradient for the others). Dividing by $\sqrt{512} \approx 22.6$ brings us back to moderate values around $\pm 1$, where softmax produces a smooth distribution and learning can proceed.

How the Weighted Sum Produces the Output

After softmax, we have a matrix $A = \text{softmax}(QK^\top / \sqrt{d_k}) \in \mathbb{R}^{T \times T}$, where each row $A_i$ is a probability distribution over all $T$ positions. The final step is to multiply this by the value matrix $V \in \mathbb{R}^{T \times d_v}$:

$$\text{Output}_i = \sum_{j=1}^{T} A_{ij} \, V_j$$

Each output position $i$ is a weighted average of all value vectors, where the weights come from how well position $i$'s query matched each position's key. If position $i$ attends strongly to position 3 ($A_{i3}$ is large) and weakly to everything else, then $\text{Output}_i \approx V_3$ (the output at position $i$ is approximately the value vector of position 3). If the attention is spread uniformly, the output is the mean of all value vectors, which tends to produce a blurred, less useful representation.

This is the complete mechanism, and we can implement it in a few lines of Python. The code below computes single-head scaled dot-product attention from scratch using only NumPy, so we can see exactly what happens at each step.

import numpy as np

np.random.seed(42)

# Dimensions
T = 4       # sequence length (4 tokens)
d_model = 8 # embedding dimension
d_k = 4     # query/key dimension
d_v = 4     # value dimension

# Random input embeddings (T tokens, each d_model-dimensional)
X = np.random.randn(T, d_model)

# Learned projection matrices (in practice, these are trained)
W_Q = np.random.randn(d_model, d_k) * 0.1
W_K = np.random.randn(d_model, d_k) * 0.1
W_V = np.random.randn(d_model, d_v) * 0.1

# Project inputs to queries, keys, values
Q = X @ W_Q  # (T, d_k)
K = X @ W_K  # (T, d_k)
V = X @ W_V  # (T, d_v)

# Raw attention scores
scores = Q @ K.T  # (T, T)
print("Raw scores (before scaling):")
print(np.round(scores, 3))
print(f"\nScore std dev: {scores.std():.3f} (expected ~sqrt(d_k)={np.sqrt(d_k):.3f})")

# Scaled scores
scaled_scores = scores / np.sqrt(d_k)
print(f"\nScaled scores std dev: {scaled_scores.std():.3f} (expected ~1.0)")

# Softmax (row-wise)
def softmax(x):
    e = np.exp(x - x.max(axis=-1, keepdims=True))  # subtract max for stability
    return e / e.sum(axis=-1, keepdims=True)

attn_weights = softmax(scaled_scores)
print("\nAttention weights (each row sums to 1):")
print(np.round(attn_weights, 3))
print("Row sums:", np.round(attn_weights.sum(axis=1), 6))

# Weighted sum of values
output = attn_weights @ V  # (T, d_v)
print("\nOutput (T x d_v):")
print(np.round(output, 3))

Notice a few things in the output. The raw scores have a standard deviation close to $\sqrt{d_k} = 2.0$, which would grow larger with bigger $d_k$. After dividing by $\sqrt{d_k}$, the standard deviation drops back toward 1.0. Each row of the attention weights sums to exactly 1 (a valid probability distribution), and the final output has the same shape as $V$ (one $d_v$-dimensional vector per position).

💡 In the code above, the weight matrices $W^Q$, $W^K$, $W^V$ are initialized with small random values (scaled by 0.1). In practice, careful initialization (like Xavier/Glorot) ensures that the projections produce outputs with unit variance from the start, which matters for stable training.

From One Head to Many

A single attention head computes one set of attention weights (one way for each position to look at all other positions). But a token might need to attend to different positions for different reasons: one attention pattern might capture syntactic dependencies ("which noun does this adjective modify?") while another captures semantic relationships ("which earlier token is this pronoun referring to?"). A single head must compress all these needs into one set of weights.

Multi-head attention (Vaswani et al., 2017) solves this by running $h$ attention heads in parallel, each with its own projection matrices $W_i^Q, W_i^K, W_i^V$, then concatenating and projecting the results:

$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) \, W^O$$
$$\text{where } \text{head}_i = \text{Attention}(XW_i^Q, XW_i^K, XW_i^V)$$

Each head uses $d_k = d_{\text{model}} / h$, so the total computation is roughly the same as a single head with the full $d_{\text{model}}$ dimension. With 8 heads and $d_{\text{model}} = 512$, each head operates in a 64-dimensional subspace. The output projection $W^O \in \mathbb{R}^{hd_v \times d_{\text{model}}}$ mixes the heads' outputs back into the model's representation space.

What would happen at the extremes? If $h = 1$, we recover single-head attention: one set of Q, K, V projections and one attention pattern per layer. The model can still learn, but each layer can only compute one weighted average of values, forcing it to compress everything into a single attention pattern. If $h = d_{\text{model}}$, each head operates in a 1-dimensional subspace, so every dot product $q^\top k$ is just a scalar times a scalar. The heads become too narrow to capture meaningful relationships. The standard choice of $h = 8$ or $h = 16$ sits between these extremes, giving each head enough dimensions to learn useful patterns while still providing enough heads for specialisation.

In practice, different heads often learn to attend to different things. Clark et al. (2019) found that in BERT, some heads specialise in attending to the previous or next token, others track syntactic dependencies like subject-verb agreement, and some attend broadly to the whole sentence. This emergent specialisation is part of what makes multi-head attention so effective: it learns a diverse set of attention patterns without being told what to look for.

We now have the complete picture of how attention scores work: project to Q, K, V, compute scaled dot products, apply softmax, take a weighted sum of values, and do this multiple times in parallel with separate heads. The next article tackles a hard constraint: what happens when the model is generating text and must not look at future tokens? That's where causal masking enters the picture, and the attention matrix takes on a very specific structure.

Quiz

Test your understanding of scaled dot-product attention.

Why do we divide the dot product scores by √d_k before applying softmax?

What role does the Value matrix V play in the attention computation?

If d_k = 256 and query/key entries are roughly standard normal, what is the approximate standard deviation of the raw (unscaled) dot products?

Why does multi-head attention use h separate heads with d_k = d_model / h instead of one head with d_k = d_model?