RoPE, QK-Norm, and Logit Soft-Capping
Several architectural changes from post-GPT-2 research compound to give significant speedrun gains:
- Rotary Positional Embeddings (RoPE): replace learned absolute positional embeddings with relative rotations in QK space, improving length generalisation.
- QK-Norm: normalises query and key vectors before the dot product, stabilising attention logits and allowing larger learning rates.
- Logit soft-capping: applies $\tanh(x / c) \cdot c$ to attention logits before softmax, preventing entropy collapse in long sequences.