Hasty Briefsbeta

Bilingual

Fast(er) regular expression engines in Ruby

a year ago
  • #benchmark
  • #regex
  • #performance
  • SerpApi faces challenges with data extraction from modern websites, sometimes resorting to regular expressions despite their potential latency issues.
  • Ruby's default regex engine, Onigmo, has weaknesses in scan time, prompting exploration of alternatives like re2, rust/regex, and pcre2.
  • re2, developed by Google, is resistant to ReDoS attacks and has well-maintained Ruby bindings, but struggles with Unicode text.
  • rust/regex is the fastest overall, especially with Unicode text, but lacks ready-to-use Ruby bindings and has limitations with regex sets.
  • pcre2, widely used in many products, has outdated Ruby bindings and lacks JIT support, making it less viable for comparison.
  • Benchmarks show rust/regex outperforming re2 and Ruby in most scenarios, particularly with Unicode text and complex patterns.
  • re2 performs well with ASCII text and bounded repeats but falters with Unicode-aware matchers and high max bounds.
  • rust/regex set functionality can be slower than sequential regexps unless carefully optimized, highlighting the need for thorough testing.
  • Both re2 and rust/regex can handle invalid UTF-8 byte sequences, unlike Ruby, which fails with such inputs.
  • Conclusions favor rust/regex for overall performance and Unicode support, while re2 is better for ASCII text and ReDoS resistance.