Masked Language Modelling

BERT (Devlin et al., 2018) is trained on two objectives: Masked Language Modelling (MLM) and Next Sentence Prediction (NSP) . MLM randomly masks 15% of tokens and asks the model to predict them from bidirectional context, enabling the model to see both left and right context simultaneously — impossible in causal decoders.

📌 RoBERTa later showed NSP hurts rather than helps; modern encoder models train with MLM only.