Hasty Briefsbeta

Predicting the Order of Upcoming Tokens Improves Language Modeling

13 days ago
  • #Token Prediction
  • #Machine Learning
  • #Language Modeling
  • Multi-Token Prediction (MTP) has been proposed as an auxiliary objective to improve next-token prediction (NTP) in language model training but shows inconsistent improvements.
  • Token Order Prediction (TOP) is introduced as an alternative, training models to order upcoming tokens by their proximity using a learning-to-rank loss.
  • TOP requires only a single additional unembedding layer compared to MTP's multiple transformer layers.
  • Models of 340M, 1.8B, and 7B parameters were pretrained using NTP, MTP, and TOP objectives.
  • Results on eight standard NLP benchmarks show that TOP overall outperforms both NTP and MTP even at scale.