Byte Latent Transformer: Patches Scale Better Than Tokens

a year ago

Introduces Byte Latent Transformer (BLT), a byte-level LLM architecture matching tokenization-based LLM performance.
BLT encodes bytes into dynamically sized patches based on next-byte entropy, improving efficiency and robustness.
Presents FLOP controlled scaling study up to 8B parameters and 4T training bytes, showing feasibility of byte-level models.
Demonstrates improved training and inference efficiency by dynamically selecting long patches for predictable data.
Shows qualitative improvements in reasoning and long tail generalization compared to tokenization-based models.
BLT scales better than tokenization-based models for fixed inference costs by growing patch and model size.

Hasty Briefsbeta