Building a Transformer from Scratch

Why Build It Ourselves?

Over the last eight articles we've examined every component of the transformer architecture in isolation: attention scores, Q/K/V projections, causal masking, multi-head attention, positional encoding, residual connections, feed-forward networks, and how these pieces combine into encoders, decoders, and encoder-decoder models. Each concept made sense on its own, but there's a gap between understanding each part and seeing how they connect into a single, runnable model. The goal of this article is to close that gap by writing a minimal decoder-only transformer in PyTorch (roughly 130 lines of code), prioritising clarity over performance.

We won't use flash attention, KV caching, fused kernels, or any other optimisation. Every line maps directly to a concept from a previous article, and we'll call those connections out as we go. By the end, we'll have a model that trains on a toy task and produces correct outputs, which is a useful sanity check that our understanding of the theory actually holds up in practice.

💡 The full implementation builds up incrementally: token embeddings → single-head attention → multi-head attention → transformer block → stacked blocks → language model head. Each class is self-contained and testable.

How Do Tokens Become Vectors?

A transformer operates on continuous vectors, not discrete token IDs, so the first step is to convert each token index into a dense vector and inject positional information. We covered why position matters in article 5: self-attention is permutation-equivariant, meaning it treats the input as a set unless we explicitly encode order. The standard approach from Vaswani et al. (2017) adds sinusoidal positional encodings to the token embeddings, though learned positional embeddings (as used in GPT-2) tend to work equally well for fixed context lengths. We'll use learned embeddings here because the code is simpler.

The embedding layer takes a tensor of token IDs with shape $(B, T)$ (batch size $B$, sequence length $T$) and produces a tensor of shape $(B, T, d_{\text{model}})$. The positional embedding does the same for position indices $[0, 1, \ldots, T-1]$, and we add the two together elementwise. The following class handles both.

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class TokenEmbedding(nn.Module):
    """Token + positional embeddings (article 5)."""
    def __init__(self, vocab_size, d_model, max_seq_len):
        super().__init__()
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(max_seq_len, d_model)

    def forward(self, x):
        B, T = x.shape
        tok = self.token_emb(x)                          # (B, T, d_model)
        pos = self.pos_emb(torch.arange(T, device=x.device))  # (T, d_model)
        return tok + pos                                  # (B, T, d_model)

One thing to notice is that pos_emb broadcasts across the batch dimension. Every sequence in the batch gets the same positional encoding, which makes sense because position 3 means the same thing regardless of which sentence it belongs to. The output is a $(B, T, d_{\text{model}})$ tensor that carries both content and position information, ready for the attention layers.

From Single-Head to Multi-Head Attention

With embedded inputs in hand, we need the mechanism that lets tokens communicate with each other. We built up the intuition for this in articles 2 through 4: each token projects into a query, key, and value, the queries and keys produce attention scores via a scaled dot product, we mask out future positions to enforce causality, and the softmax-weighted values become the output. Let's start with a single attention head and then extend to multiple heads.

Recall the scaled dot-product attention formula from article 2:

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

The $\sqrt{d_k}$ scaling (where $d_k$ is the dimension of each head) prevents the dot products from growing large in magnitude as the dimension increases, which would push the softmax into saturated regions where gradients vanish. Without it, training tends to be unstable for $d_k$ values above about 32. The causal mask from article 3 sets the upper-triangular entries to $-\infty$ before the softmax, ensuring that position $i$ can only attend to positions $j \leq i$.

A single head captures one kind of relationship (maybe syntactic adjacency, maybe coreference). Article 4 argued that we want multiple heads running in parallel, each with its own Q/K/V projections operating on a slice of the embedding dimension, so the model can attend to different relationship types simultaneously. If we have $h$ heads and embedding dimension $d_{\text{model}}$, each head operates on $d_k = d_{\text{model}} / h$ dimensions. In practice, we implement this by projecting Q, K, and V to the full $d_{\text{model}}$ dimensions with a single linear layer, then reshaping into $h$ heads. The following class implements the complete multi-head causal self-attention.

class CausalSelfAttention(nn.Module):
    """Multi-head causal self-attention (articles 2-4)."""
    def __init__(self, d_model, n_heads, max_seq_len):
        super().__init__()
        assert d_model % n_heads == 0, "d_model must be divisible by n_heads"
        self.n_heads = n_heads
        self.d_k = d_model // n_heads

        self.qkv_proj = nn.Linear(d_model, 3 * d_model)  # single projection for Q, K, V
        self.out_proj = nn.Linear(d_model, d_model)

        # Precompute causal mask (article 3): lower-triangular = 1, upper = 0
        mask = torch.tril(torch.ones(max_seq_len, max_seq_len))
        self.register_buffer("mask", mask.unsqueeze(0).unsqueeze(0))  # (1, 1, T, T)

    def forward(self, x):
        B, T, C = x.shape
        # Project to Q, K, V and split into heads
        qkv = self.qkv_proj(x)                           # (B, T, 3*d_model)
        q, k, v = qkv.chunk(3, dim=-1)                   # each (B, T, d_model)

        # Reshape: (B, T, d_model) -> (B, n_heads, T, d_k)
        q = q.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        k = k.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        v = v.view(B, T, self.n_heads, self.d_k).transpose(1, 2)

        # Scaled dot-product attention (article 2)
        scores = (q @ k.transpose(-2, -1)) / math.sqrt(self.d_k)  # (B, h, T, T)

        # Apply causal mask (article 3): set future positions to -inf
        scores = scores.masked_fill(self.mask[:, :, :T, :T] == 0, float('-inf'))

        weights = F.softmax(scores, dim=-1)               # (B, h, T, T)
        out = weights @ v                                  # (B, h, T, d_k)

        # Concatenate heads and project (article 4)
        out = out.transpose(1, 2).contiguous().view(B, T, C)
        return self.out_proj(out)                          # (B, T, d_model)

There are a few details worth highlighting. We use a single linear layer qkv_proj to compute Q, K, and V in one matrix multiply, then split with chunk , which is mathematically identical to three separate projections but faster because we issue one GEMM instead of three. The causal mask is registered as a buffer (not a parameter) so it moves to the right device automatically and isn't updated by the optimiser. And the final out_proj is the learned linear layer that recombines the concatenated head outputs, as described in article 4.

Assembling the Transformer Block and Stacking It

With attention and embeddings implemented, we need two more ingredients from articles 6 and 7: residual connections and the position-wise feed-forward network (FFN). A transformer block applies attention, adds the result back to the input through a residual connection, normalises with layer norm, then passes through a two-layer FFN with another residual and another layer norm. This is the pre-norm variant (layer norm before each sub-layer), which tends to train more stably than the original post-norm layout, and it's what GPT-2 and most modern decoder models use.

The FFN from article 7 is a simple two-layer MLP that expands the dimension by a factor of 4, applies a non-linearity, and projects back down. We'll use GELU as the activation, following GPT-2.

class FeedForward(nn.Module):
    """Position-wise feed-forward network (article 7)."""
    def __init__(self, d_model):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),
            nn.GELU(),
            nn.Linear(4 * d_model, d_model),
        )

    def forward(self, x):
        return self.net(x)


class TransformerBlock(nn.Module):
    """Pre-norm transformer block: LN -> attention -> residual -> LN -> FFN -> residual (article 6)."""
    def __init__(self, d_model, n_heads, max_seq_len):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)
        self.attn = CausalSelfAttention(d_model, n_heads, max_seq_len)
        self.ln2 = nn.LayerNorm(d_model)
        self.ffn = FeedForward(d_model)

    def forward(self, x):
        x = x + self.attn(self.ln1(x))   # residual around attention
        x = x + self.ffn(self.ln2(x))    # residual around FFN
        return x

Each residual connection (the x = x + ... pattern) serves the purpose we discussed in article 6: it lets gradients flow directly through the addition, bypassing the sub-layer's parameters entirely if needed, which makes it much easier to train deep stacks. Without residuals, a 6-layer transformer often fails to converge at all.

Now we can stack $N$ of these blocks to form the full model. The language model adds a final layer norm after the last block, then a linear projection from $d_{\text{model}}$ to the vocabulary size, producing one logit per token in the vocabulary at each sequence position. During training, we compute cross-entropy loss between these logits and the shifted target tokens (the token at position $i+1$ is the label for position $i$, because we're doing next-token prediction as described in article 8).

class DecoderTransformer(nn.Module):
    """Minimal decoder-only transformer language model."""
    def __init__(self, vocab_size, d_model, n_heads, n_layers, max_seq_len):
        super().__init__()
        self.embedding = TokenEmbedding(vocab_size, d_model, max_seq_len)
        self.blocks = nn.Sequential(
            *[TransformerBlock(d_model, n_heads, max_seq_len) for _ in range(n_layers)]
        )
        self.ln_final = nn.LayerNorm(d_model)
        self.head = nn.Linear(d_model, vocab_size, bias=False)

    def forward(self, idx, targets=None):
        x = self.embedding(idx)           # (B, T, d_model)
        x = self.blocks(x)                # (B, T, d_model)
        x = self.ln_final(x)              # (B, T, d_model)
        logits = self.head(x)             # (B, T, vocab_size)

        loss = None
        if targets is not None:
            # Flatten for cross-entropy: predictions vs next tokens
            B, T, V = logits.shape
            loss = F.cross_entropy(logits.view(B * T, V), targets.view(B * T))
        return logits, loss

That's the entire architecture. Let's count: TokenEmbedding is roughly 10 lines, CausalSelfAttention is about 30, FeedForward and TransformerBlock together are about 20, and DecoderTransformer is another 20. In roughly 80 lines of model code, we've implemented every concept from articles 1–8.

Before training, it's useful to verify that the shapes are correct by running a forward pass with dummy data. The following snippet instantiates a small model and checks that the output has the expected dimensions.

# Shape verification with dummy data
vocab_size = 16
d_model = 64
n_heads = 4
n_layers = 2
max_seq_len = 32
batch_size = 2
seq_len = 10

model = DecoderTransformer(vocab_size, d_model, n_heads, n_layers, max_seq_len)
idx = torch.randint(0, vocab_size, (batch_size, seq_len))      # (2, 10)
targets = torch.randint(0, vocab_size, (batch_size, seq_len))   # (2, 10)

logits, loss = model(idx, targets)
print(f"Logits shape: {logits.shape}")   # expected: (2, 10, 16)
print(f"Loss: {loss.item():.4f}")        # expected: ~2.77 (≈ -ln(1/16))
# Logits shape: torch.Size([2, 10, 16])
# Loss: 2.8XXX (close to ln(16) ≈ 2.77, since the model is untrained)

The initial loss should be close to $\ln(V)$ where $V$ is the vocabulary size, because an untrained model assigns roughly uniform probability to all tokens. If we see a loss of about $\ln(16) \approx 2.77$ for our 16-token vocabulary, the shapes and the loss computation are correct and we can proceed to training.

Training on a Toy Task

A model this small (a few thousand parameters) can't learn natural language, but it can learn simple algorithmic tasks that verify the attention mechanism works. We'll train it to reverse short sequences: given an input like [5, 3, 8, 0, SEP, ?, ?, ?, ?] , the model should learn to produce [0, 8, 3, 5] after the separator token. This task requires the model to attend to specific earlier positions (the attention patterns should roughly anti-align with position), making it a good stress test for causal attention.

We'll generate training data on the fly: random sequences of length $L$, followed by a separator token, followed by the same sequence in reverse. The model sees the full concatenation as one sequence and is trained with the standard next-token prediction objective. We only compute loss on the output portion (after the separator), since the input portion has no predictable target.

# --- Toy task: learn to reverse a sequence ---
import torch

# Tokens 0..9 are data tokens, 10 = SEP, 11 = PAD (unused here)
VOCAB_SIZE = 12
SEP_TOKEN = 10
SEQ_LEN = 4       # length of sequence to reverse
TOTAL_LEN = 2 * SEQ_LEN + 1  # input + SEP + reversed output

def make_batch(batch_size):
    """Generate (input, target) pairs for the reversal task."""
    data = torch.randint(0, 10, (batch_size, SEQ_LEN))
    sep = torch.full((batch_size, 1), SEP_TOKEN)
    reversed_data = data.flip(1)
    full_seq = torch.cat([data, sep, reversed_data], dim=1)  # (B, 2*L+1)

    # Input is all tokens except the last; target is all tokens except the first
    x = full_seq[:, :-1]
    y = full_seq[:, 1:]
    return x, y

# Hyperparameters
D_MODEL = 64
N_HEADS = 4
N_LAYERS = 2
MAX_SEQ_LEN = TOTAL_LEN
LR = 3e-4
STEPS = 2000
BATCH_SIZE = 64

model = DecoderTransformer(VOCAB_SIZE, D_MODEL, N_HEADS, N_LAYERS, MAX_SEQ_LEN)
optimizer = torch.optim.AdamW(model.parameters(), lr=LR)

# Training loop
for step in range(STEPS):
    x, y = make_batch(BATCH_SIZE)
    logits, loss = model(x, y)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if step % 400 == 0:
        print(f"Step {step:4d} | Loss: {loss.item():.4f}")

# --- Evaluate ---
model.eval()
with torch.no_grad():
    x_test, y_test = make_batch(5)
    logits, _ = model(x_test)
    preds = logits.argmax(dim=-1)  # greedy decoding

    for i in range(5):
        inp = x_test[i, :SEQ_LEN].tolist()
        expected = list(reversed(inp))
        got = preds[i, SEQ_LEN:].tolist()
        status = "PASS" if got == expected else "FAIL"
        print(f"Input: {inp} | Expected: {expected} | Got: {got} | {status}")

With 2,000 training steps and a batch size of 64, this model typically converges to near-zero loss on the reversal task within a few minutes on a CPU. The key signal that training is working is the loss curve: it should start near $\ln(12) \approx 2.48$ (uniform over 12 tokens), drop rapidly during the first few hundred steps as the model learns the structure of the task, and flatten near zero once it has learned to reverse perfectly.

If training fails (the loss stays high or oscillates), the most common causes are a learning rate that's too high (try lowering to 1e-4), too few layers or heads for the model to route information correctly, or a bug in the causal mask (which would let the model cheat by looking at future tokens during training, then fail at test time). These failure modes are instructive in themselves, because they map directly to the concepts we've been building up. The causal mask enforces the autoregressive property from article 3, multiple heads enable the parallel attention patterns from article 4, and the residual connections from article 6 make it possible for gradients to flow through multiple stacked blocks.

💡 This implementation is intentionally unoptimised. A production transformer would use flash attention (Dao et al., 2022) to avoid materialising the full $T \times T$ attention matrix, KV caching to avoid recomputing past keys and values during generation, and fused kernels to reduce memory round-trips. Those are engineering improvements that don't change the underlying math (which is exactly what we've been studying).

We now have a working transformer, built from the ground up in about 130 lines. Every class corresponds to a concept from this track: TokenEmbedding is article 5, CausalSelfAttention is articles 2–4, TransformerBlock is articles 6–7, and DecoderTransformer ties it all together as in article 8. The next article moves from architecture to training: how do we go from this tiny toy model to a large language model that actually understands language? The answer involves pre-training at scale and fine-tuning with instructions.

Quiz

Test your understanding of the transformer implementation.

Why do we divide the attention scores by √d_k before applying softmax?

To reduce the memory footprint of the attention matrix

To ensure the attention weights sum to 1

To prevent dot products from growing large with dimension, which would push softmax into saturated regions with vanishing gradients

To make the output of attention have unit variance

What should the initial loss be for an untrained model with a vocabulary of 16 tokens?

Approximately 0, since the model hasn't learned anything wrong yet

Approximately ln(16) ≈ 2.77, since the model assigns roughly uniform probability

Approximately 16, equal to the vocabulary size

Approximately 1.0, the default cross-entropy baseline

In the pre-norm transformer block, where is layer normalisation applied?

After the attention and FFN sub-layers, outside the residual connection

Only once, at the very end of the block

Before each sub-layer (attention and FFN), inside the residual connection

Between the Q, K, and V projections within the attention mechanism

Why is the causal mask registered as a buffer rather than a parameter?

Buffers are faster to compute than parameters

It should move to the correct device automatically but not be updated by the optimiser, since it's a fixed constant

Parameters cannot store boolean tensors

Buffers allow the mask to change shape dynamically during training