Fast(er) regular expression engines in Ruby
a year ago
- #benchmark
- #regex
- #performance
- SerpApi faces challenges with data extraction from modern websites, sometimes resorting to regular expressions despite their potential latency issues.
- Ruby's default regex engine, Onigmo, has weaknesses in scan time, prompting exploration of alternatives like re2, rust/regex, and pcre2.
- re2, developed by Google, is resistant to ReDoS attacks and has well-maintained Ruby bindings, but struggles with Unicode text.
- rust/regex is the fastest overall, especially with Unicode text, but lacks ready-to-use Ruby bindings and has limitations with regex sets.
- pcre2, widely used in many products, has outdated Ruby bindings and lacks JIT support, making it less viable for comparison.
- Benchmarks show rust/regex outperforming re2 and Ruby in most scenarios, particularly with Unicode text and complex patterns.
- re2 performs well with ASCII text and bounded repeats but falters with Unicode-aware matchers and high max bounds.
- rust/regex set functionality can be slower than sequential regexps unless carefully optimized, highlighting the need for thorough testing.
- Both re2 and rust/regex can handle invalid UTF-8 byte sequences, unlike Ruby, which fails with such inputs.
- Conclusions favor rust/regex for overall performance and Unicode support, while re2 is better for ASCII text and ReDoS resistance.