A Ruby Regular Expression Engine
11 days ago
- #Regular Expressions
- #Ruby
- #Unicode
- The exreg gem is a pure-Ruby implementation of a Unicode regular expression engine.
- It uses a Thompson-style NFA virtual machine, making it immune to ReDoS caused by catastrophic backtracking.
- Traditional backtracking engines can take exponential time on certain inputs, while Thompson-style engines run in linear time.
- Exreg consists of a parser, compiler, and virtual machine, with support for Unicode properties and case folding.
- It uses a custom binary format for efficient storage and loading of Unicode data.
- USet objects efficiently represent sets of Unicode codepoints and support various set operations.
- ByteSet objects are used in bytecode for fast membership testing using bitwise operations.
- Exreg only supports Unicode encodings, assuming UTF-8 by default, and treats strings as byte arrays.
- Future improvements could include adding a backtracking engine for features like backreferences and lookaround assertions.
- A JIT compiler could be introduced to speed up the currently interpreted VM.