Predicting the Order of Upcoming Tokens Improves Language Modeling
13 days ago
- #Token Prediction
- #Machine Learning
- #Language Modeling
- Multi-Token Prediction (MTP) has been proposed as an auxiliary objective to improve next-token prediction (NTP) in language model training but shows inconsistent improvements.
- Token Order Prediction (TOP) is introduced as an alternative, training models to order upcoming tokens by their proximity using a learning-to-rank loss.
- TOP requires only a single additional unembedding layer compared to MTP's multiple transformer layers.
- Models of 340M, 1.8B, and 7B parameters were pretrained using NTP, MTP, and TOP objectives.
- Results on eight standard NLP benchmarks show that TOP overall outperforms both NTP and MTP even at scale.