Byte Latent Transformer: Patches Scale Better Than Tokens
a year ago
- #natural language processing
- #transformer models
- #machine learning
- Introduces Byte Latent Transformer (BLT), a byte-level LLM architecture matching tokenization-based LLM performance.
- BLT encodes bytes into dynamically sized patches based on next-byte entropy, improving efficiency and robustness.
- Presents FLOP controlled scaling study up to 8B parameters and 4T training bytes, showing feasibility of byte-level models.
- Demonstrates improved training and inference efficiency by dynamically selecting long patches for predictable data.
- Shows qualitative improvements in reasoning and long tail generalization compared to tokenization-based models.
- BLT scales better than tokenization-based models for fixed inference costs by growing patch and model size.