Finding Optimal Tokenizers
10 hours ago
- #integer-linear-programming
- #tokenization
- #optimal-algorithms
- Algorithm computes optimal tokenizer in some settings, similar to solving TSP with cutting-plane techniques.
- Optimal tokenization is theoretically intractable but solvable in practice, though caveats include minimal improvement over existing methods, potential generalization issues, and efficiency trade-offs.
- Tokenizers map tokens to bytes (vocabulary) using byte-pair encoding (BPE) for compression.
- Tokenization can be formulated as an integer linear programming (ILP) problem with color and edge variables.
- Continuous LP solutions may not be integral, leading to rounding for suboptimal solutions.
- Cutting planes, inspired by TSP solvers, can be applied to the tokenization ILP to find optimal solutions by adding constraints.
- Codex assisted in discovering effective cut families, such as cycle constraints, through brute force and analysis.
- Experiments used HiGHS solver on limited hardware, focusing on single eBooks like Pride and Prejudice with vocab size 512.
- Optimal tokenizer found for Pride and Prejudice with 512 vocab size, but scaling to 1024 required additional cut families and faced LP solve time bottlenecks.
- Future work includes improving LP solve times, scaling to larger corpora, and removing pretokenizer constraints.