Compute-Optimal Training
Chinchilla (Hoffmann et al., 2022) showed that for a given compute budget $C$, the optimal model size $N$ and token count $D$ satisfy $N \approx D$, suggesting most LLMs at the time were undertrained rather than too small.
$$L(N, D) \approx \frac{A}{N^\alpha} + \frac{B}{D^\beta} + E$$
where $E$ is irreducible loss from data entropy, and $A$, $B$, $\alpha$, $\beta$ are fitted constants.