RoPE, QK-Norm, and Logit Soft-Capping

Several architectural changes from post-GPT-2 research compound to give significant speedrun gains:

  • Rotary Positional Embeddings (RoPE): replace learned absolute positional embeddings with relative rotations in QK space, improving length generalisation.
  • QK-Norm: normalises query and key vectors before the dot product, stabilising attention logits and allowing larger learning rates.
  • Logit soft-capping: applies $\tanh(x / c) \cdot c$ to attention logits before softmax, preventing entropy collapse in long sequences.