Compute-Optimal Training

Chinchilla (Hoffmann et al., 2022) showed that for a given compute budget $C$, the optimal model size $N$ and token count $D$ satisfy $N \approx D$, suggesting most LLMs at the time were undertrained rather than too small.

$$L(N, D) \approx \frac{A}{N^\alpha} + \frac{B}{D^\beta} + E$$

where $E$ is irreducible loss from data entropy, and $A$, $B$, $\alpha$, $\beta$ are fitted constants.