Fast(er) regular expression engines in Ruby

a year ago

SerpApi faces challenges with data extraction from modern websites, sometimes resorting to regular expressions despite their potential latency issues.
Ruby's default regex engine, Onigmo, has weaknesses in scan time, prompting exploration of alternatives like re2, rust/regex, and pcre2.
re2, developed by Google, is resistant to ReDoS attacks and has well-maintained Ruby bindings, but struggles with Unicode text.
rust/regex is the fastest overall, especially with Unicode text, but lacks ready-to-use Ruby bindings and has limitations with regex sets.
pcre2, widely used in many products, has outdated Ruby bindings and lacks JIT support, making it less viable for comparison.
Benchmarks show rust/regex outperforming re2 and Ruby in most scenarios, particularly with Unicode text and complex patterns.
re2 performs well with ASCII text and bounded repeats but falters with Unicode-aware matchers and high max bounds.
rust/regex set functionality can be slower than sequential regexps unless carefully optimized, highlighting the need for thorough testing.
Both re2 and rust/regex can handle invalid UTF-8 byte sequences, unlike Ruby, which fails with such inputs.
Conclusions favor rust/regex for overall performance and Unicode support, while re2 is better for ASCII text and ReDoS resistance.

Hasty Briefsbeta