Orthogonalizing Gradient Updates

Muon (Momentum + Orthogonalization Using Newton-Schulz) replaces AdamW for hidden-layer weight matrices. It applies Nesterov momentum and then orthogonalizes the update via a few Newton-Schulz iterations, ensuring the gradient update matrix has orthonormal columns.

This improves the effective learning signal per step and is one of the largest single speedrun gains, reducing training time by roughly 25–30% vs the AdamW baseline.

📌 Muon is applied only to hidden weight matrices. Embeddings and biases still use AdamW.