Masked Attention in Decoders
In a decoder-only model, tokens may only attend to earlier positions in the sequence. This is enforced by adding a causal mask — a lower-triangular matrix that sets future attention logits to $-\infty$ before softmax.
This property makes decoder models well-suited to autoregressive generation: at each step, the model conditions on all previously generated tokens and predicts the next one.