The Baseline: Reproducing GPT-2

What Is the NanoGPT Speedrun?

Andrej Karpathy's NanoGPT Speedrun is a community effort to reproduce GPT-2 (124M parameters) on OpenWebText in as little wall-clock time as possible, tracking only hardware-neutral improvements — algorithmic gains, not faster GPUs.

💡 The target metric is validation loss on OpenWebText matching the original GPT-2 (124M) checkpoint: ~3.28 bits-per-character. Each entry in the speedrun must train from scratch.

The baseline is a faithful PyTorch re-implementation of GPT-2, trained with AdamW and a cosine learning-rate schedule. It serves as the reference point against which every subsequent improvement is measured.

Architecture

GPT-2 (124M) is a decoder-only Transformer with the following hyperparameters:

Layers: 12 Transformer blocks
Heads: 12 attention heads
Embedding dimension: 768
Context length: 1024 tokens
Vocabulary size: 50,257 (GPT-2 BPE tokenizer)

Each block applies pre-layer-norm, multi-head causal self-attention, and an MLP with GELU activations. The embedding and unembedding matrices share weights (weight tying).

The Attention Mechanism

Causal self-attention prevents each token from attending to future positions. For a sequence of length $T$ and embedding dimension $d$, the scaled dot-product attention is:

\text{Attn}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right) V

where the mask ensures $\text{Attn}_{ij} = 0$ for all $j > i$.

Training Setup

The baseline training loop is straightforward:

import torch
from torch.optim import AdamW

# Cosine LR schedule with linear warmup
def get_lr(step, warmup_steps, max_steps, max_lr, min_lr):
    if step < warmup_steps:
        return max_lr * step / warmup_steps
    if step > max_steps:
        return min_lr
    decay = (step - warmup_steps) / (max_steps - warmup_steps)
    coeff = 0.5 * (1.0 + math.cos(math.pi * decay))
    return min_lr + coeff * (max_lr - min_lr)

optimizer = AdamW(model.parameters(), lr=6e-4, betas=(0.9, 0.95), weight_decay=0.1)

for step in range(max_steps):
    lr = get_lr(step, warmup_steps=715, max_steps=19073, max_lr=6e-4, min_lr=6e-5)
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr
    loss = model(x, targets=y)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()
    optimizer.zero_grad(set_to_none=True)

Key training details: batch size of 524,288 tokens (gradient accumulation), gradient clipping at 1.0, and weight decay of 0.1 applied to all 2D parameters.

Baseline Results

On a single 8×H100 node, the baseline reaches the target validation loss in approximately ~1 hour . Subsequent speedrun entries aim to match the same loss in less time through pure algorithmic improvements.

Which parameter controls how far each token can 'look back' during self-attention in GPT-2?

Embedding dimension (d_model)

Context length (T)

Number of attention heads

Vocabulary size