Hasty Briefsbeta

Full Unicode Search at 50× ICU Speed with AVX‑512

4 days ago
  • #UTF-8
  • #Performance
  • #Unicode
  • StringZilla is a high-performance open-source library for Unicode and UTF-8 handling, focusing on speed and correctness.
  • It leverages AVX-512 on Intel and AMD CPUs to accelerate common operations like tokenizing text, case-folding, and substring search.
  • StringZilla is significantly faster than alternatives like ICU and PCRE2, achieving speedups of 10× to 20,000× in some cases.
  • The library is tested against the latest Unicode specifications and real-world data to ensure correctness.
  • UTF-8 is the dominant text encoding on the internet, covering 98% of content as of 2024, with legacy encodings making up the remaining 2%.
  • StringZilla provides APIs for multiple programming languages, including C/C++, Python, Rust, Swift, Node.js, and Go.
  • Future plans include optimizing for more scripts (e.g., Georgian, Korean) and porting to Arm architectures.