What Is the NanoGPT Speedrun?
Andrej Karpathy's NanoGPT Speedrun is a community effort to reproduce GPT-2 (124M parameters) on OpenWebText in as little wall-clock time as possible, tracking only hardware-neutral improvements — algorithmic gains, not faster GPUs.
The baseline is a faithful PyTorch re-implementation of GPT-2, trained with AdamW and a cosine learning-rate schedule. It serves as the reference point against which every subsequent improvement is measured.
Architecture
GPT-2 (124M) is a decoder-only Transformer with the following hyperparameters:
- Layers: 12 Transformer blocks
- Heads: 12 attention heads
- Embedding dimension: 768
- Context length: 1024 tokens
- Vocabulary size: 50,257 (GPT-2 BPE tokenizer)
Each block applies pre-layer-norm, multi-head causal self-attention, and an MLP with GELU activations. The embedding and unembedding matrices share weights (weight tying).
The Attention Mechanism
Causal self-attention prevents each token from attending to future positions. For a sequence of length $T$ and embedding dimension $d$, the scaled dot-product attention is:
where the mask ensures $\text{Attn}_{ij} = 0$ for all $j > i$.
Training Setup
The baseline training loop is straightforward:
import torch
from torch.optim import AdamW
# Cosine LR schedule with linear warmup
def get_lr(step, warmup_steps, max_steps, max_lr, min_lr):
if step < warmup_steps:
return max_lr * step / warmup_steps
if step > max_steps:
return min_lr
decay = (step - warmup_steps) / (max_steps - warmup_steps)
coeff = 0.5 * (1.0 + math.cos(math.pi * decay))
return min_lr + coeff * (max_lr - min_lr)
optimizer = AdamW(model.parameters(), lr=6e-4, betas=(0.9, 0.95), weight_decay=0.1)
for step in range(max_steps):
lr = get_lr(step, warmup_steps=715, max_steps=19073, max_lr=6e-4, min_lr=6e-5)
for param_group in optimizer.param_groups:
param_group['lr'] = lr
loss = model(x, targets=y)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
optimizer.zero_grad(set_to_none=True)
Key training details: batch size of 524,288 tokens (gradient accumulation), gradient clipping at 1.0, and weight decay of 0.1 applied to all 2D parameters.
Baseline Results
On a single 8×H100 node, the baseline reaches the target validation loss in approximately ~1 hour . Subsequent speedrun entries aim to match the same loss in less time through pure algorithmic improvements.
Which parameter controls how far each token can 'look back' during self-attention in GPT-2?