Hasty Briefsbeta

Bilingual

Finding Optimal Tokenizers

11 hours ago
  • #integer-linear-programming
  • #tokenization
  • #optimal-algorithms
  • Algorithm computes optimal tokenizer in some settings, similar to solving TSP with cutting-plane techniques.
  • Optimal tokenization is theoretically intractable but solvable in practice, though caveats include minimal improvement over existing methods, potential generalization issues, and efficiency trade-offs.
  • Tokenizers map tokens to bytes (vocabulary) using byte-pair encoding (BPE) for compression.
  • Tokenization can be formulated as an integer linear programming (ILP) problem with color and edge variables.
  • Continuous LP solutions may not be integral, leading to rounding for suboptimal solutions.
  • Cutting planes, inspired by TSP solvers, can be applied to the tokenization ILP to find optimal solutions by adding constraints.
  • Codex assisted in discovering effective cut families, such as cycle constraints, through brute force and analysis.
  • Experiments used HiGHS solver on limited hardware, focusing on single eBooks like Pride and Prejudice with vocab size 512.
  • Optimal tokenizer found for Pride and Prejudice with 512 vocab size, but scaling to 1024 required additional cut families and faced LP solve time bottlenecks.
  • Future work includes improving LP solve times, scaling to larger corpora, and removing pretokenizer constraints.