The bitter lesson is coming for tokenization

10 months ago

The Bitter Lesson emphasizes using general-purpose methods that leverage compute and data over domain-specific crafted methods.
Tokenization, particularly Byte-Pair Encoding (BPE), is a bottleneck in transformer models, leading to inefficiencies and downstream issues like 'glitch tokens'.
Tokenization's role is to compress byte representations to reduce computational complexity, but it often fails to achieve optimal trade-offs between compression and granularity.
Pure byte-level models like ByT5 and MambaByte show promise in removing tokenization but face challenges like increased compute and training time.
Recent architectures like Byte Latent Transformer (BLT) aim to learn tokenization end-to-end, improving performance and scaling curves while reducing inference FLOPS.
BLT uses dynamic patching based on entropy thresholds, allowing adaptive compute allocation and better handling of out-of-distribution data.
BLT outperforms subword-level models in compute-controlled settings, especially on character-level tasks, and shows better scaling trends.
Future directions include integrating the patcher end-to-end, extending BLT to multi-modal tasks, and addressing challenges like dynamic patch boundaries in large contexts.

Hasty Briefsbeta