Yes, all longest regex matches in linear time is possible
9 hours ago
- #algorithms
- #regex
- #performance
- Regex engines like RE2, Go, and Rust advertise linear-time matching for a single match, but finding all matches can be quadratic.
- The problem arises from overlapping match candidates and the need to restart the search from each position.
- Earliest match semantics can avoid quadratic time but changes the results, providing shorter matches than leftmost-longest semantics.
- RE# introduces a hardened mode that guarantees linear time for all matches without altering semantics, using a two-pass algorithm.
- Hardened mode is opt-in due to a performance trade-off, as most real-world patterns don't trigger the quadratic case.
- RE# includes CPU optimizations like AVX2 and NEON for skip acceleration, improving performance on literal-heavy patterns.
- Streaming semantics are limited by leftmost-longest matching, which may require unbounded buffering, but chunked approaches can help.
- RE# does not support capture groups or lazy quantifiers, focusing on leftmost-longest (POSIX) semantics and linear-time guarantees.