Architectural Tweaks

RoPE, QK-Norm, and Logit Soft-Capping

Several architectural changes from post-GPT-2 research compound to give significant speedrun gains:

Rotary Positional Embeddings (RoPE): replace learned absolute positional embeddings with relative rotations in QK space, improving length generalisation.
QK-Norm: normalises query and key vectors before the dot product, stabilising attention logits and allowing larger learning rates.
Logit soft-capping: applies $\tanh(x / c) \cdot c$ to attention logits before softmax, preventing entropy collapse in long sequences.