The bitter lesson is coming for tokenization
10 months ago
- #tokenization
- #machine-learning
- #transformers
- The Bitter Lesson emphasizes using general-purpose methods that leverage compute and data over domain-specific crafted methods.
- Tokenization, particularly Byte-Pair Encoding (BPE), is a bottleneck in transformer models, leading to inefficiencies and downstream issues like 'glitch tokens'.
- Tokenization's role is to compress byte representations to reduce computational complexity, but it often fails to achieve optimal trade-offs between compression and granularity.
- Pure byte-level models like ByT5 and MambaByte show promise in removing tokenization but face challenges like increased compute and training time.
- Recent architectures like Byte Latent Transformer (BLT) aim to learn tokenization end-to-end, improving performance and scaling curves while reducing inference FLOPS.
- BLT uses dynamic patching based on entropy thresholds, allowing adaptive compute allocation and better handling of out-of-distribution data.
- BLT outperforms subword-level models in compute-controlled settings, especially on character-level tasks, and shows better scaling trends.
- Future directions include integrating the patcher end-to-end, extending BLT to multi-modal tasks, and addressing challenges like dynamic patch boundaries in large contexts.