Full Unicode Search at 50× ICU Speed with AVX‑512
4 days ago
- #UTF-8
- #Performance
- #Unicode
- StringZilla is a high-performance open-source library for Unicode and UTF-8 handling, focusing on speed and correctness.
- It leverages AVX-512 on Intel and AMD CPUs to accelerate common operations like tokenizing text, case-folding, and substring search.
- StringZilla is significantly faster than alternatives like ICU and PCRE2, achieving speedups of 10× to 20,000× in some cases.
- The library is tested against the latest Unicode specifications and real-world data to ensure correctness.
- UTF-8 is the dominant text encoding on the internet, covering 98% of content as of 2024, with legacy encodings making up the remaining 2%.
- StringZilla provides APIs for multiple programming languages, including C/C++, Python, Rust, Swift, Node.js, and Go.
- Future plans include optimizing for more scripts (e.g., Georgian, Korean) and porting to Arm architectures.