Masked Attention in Decoders

In a decoder-only model, tokens may only attend to earlier positions in the sequence. This is enforced by adding a causal mask — a lower-triangular matrix that sets future attention logits to $-\infty$ before softmax.

This property makes decoder models well-suited to autoregressive generation: at each step, the model conditions on all previously generated tokens and predicts the next one.