The quadratic problem nobody fixed
4 days ago
- #algorithm
- #regex
- #performance
- Regex engines have an O(n²) complexity problem when finding all matches, a longstanding issue since the 1970s.
- Linear-time matching promises by engines like RE2, Go's regexp, and Rust's regex crate only apply to single matches, not all matches.
- The quadratic complexity arises from patterns like .*a|b on inputs like 'bbbb...', causing a triangular sum of work.
- Aho-Corasick (1975) solves the problem for fixed strings with linear time, but not for regexes.
- Hyperscan/Vectorscan uses 'earliest match' semantics for linear time but alters match results.
- RE# introduces a two-pass algorithm to find all matches in linear time without changing semantics.
- RE# offers a 'hardened' mode guaranteeing linear time even on adversarial input, though with a performance tradeoff.
- RE# outperforms Aho-Corasick on large dictionaries due to more cache-efficient DFA handling.
- RE# lacks support for capture groups and lazy quantifiers, focusing on POSIX semantics and boolean operations.
- A grep tool called 're' leverages RE#'s boolean operators for advanced search capabilities.