Hasty Briefsbeta

Bilingual

The quadratic problem nobody fixed

4 days ago
  • #algorithm
  • #regex
  • #performance
  • Regex engines have an O(n²) complexity problem when finding all matches, a longstanding issue since the 1970s.
  • Linear-time matching promises by engines like RE2, Go's regexp, and Rust's regex crate only apply to single matches, not all matches.
  • The quadratic complexity arises from patterns like .*a|b on inputs like 'bbbb...', causing a triangular sum of work.
  • Aho-Corasick (1975) solves the problem for fixed strings with linear time, but not for regexes.
  • Hyperscan/Vectorscan uses 'earliest match' semantics for linear time but alters match results.
  • RE# introduces a two-pass algorithm to find all matches in linear time without changing semantics.
  • RE# offers a 'hardened' mode guaranteeing linear time even on adversarial input, though with a performance tradeoff.
  • RE# outperforms Aho-Corasick on large dictionaries due to more cache-efficient DFA handling.
  • RE# lacks support for capture groups and lazy quantifiers, focusing on POSIX semantics and boolean operations.
  • A grep tool called 're' leverages RE#'s boolean operators for advanced search capabilities.