Show HN: TokenDagger – A tokenizer faster than OpenAI's Tiktoken
10 months ago
- #Tokenization
- #Performance
- #OpenAI
- TokenDagger is a fast implementation of OpenAI's TikToken for large-scale text processing.
- It offers 2x throughput and is 4x faster on code sample tokenization.
- Optimized PCRE2 regex engine for efficient token pattern matching.
- Full compatibility with OpenAI's TikToken tokenizer as a drop-in replacement.
- Simplified BPE algorithm to reduce performance impact of large special token vocabulary.
- Performance benchmarks show TokenDagger is 4.02x faster on code tokenization.
- Installation and setup instructions provided for easy adoption.
- Requires libpcre2-dev and Python 3 development tools for setup.