Predicting the Order of Upcoming Tokens Improves Language Modeling

13 days ago

Copy Link

Multi-Token Prediction (MTP) has been proposed as an auxiliary objective to improve next-token prediction (NTP) in language model training but shows inconsistent improvements.
Token Order Prediction (TOP) is introduced as an alternative, training models to order upcoming tokens by their proximity using a learning-to-rank loss.
TOP requires only a single additional unembedding layer compared to MTP's multiple transformer layers.
Models of 340M, 1.8B, and 7B parameters were pretrained using NTP, MTP, and TOP objectives.
Results on eight standard NLP benchmarks show that TOP overall outperforms both NTP and MTP even at scale.

Hasty Briefsbeta