SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Corpora
12 hours ago
- #search algorithm
- #natural language processing
- #big data
- Introduction of SoftMatcha 2, an ultra-fast and flexible search algorithm for trillion-scale natural language corpora.
- Achieves search times under 0.3 seconds while handling semantic variations like substitution, insertion, and deletion.
- Utilizes string matching based on suffix arrays for scalability with corpus size.
- Key algorithmic ideas include fast exact lookup via disk-aware design and dynamic corpus-aware pruning.
- Theoretical demonstration of suppressing exponential growth in search space by leveraging natural language statistical properties.
- Outperforms existing methods (infini-gram, infini-gram mini, SoftMatcha) on FineWeb-Edu (1.4T tokens) in search latency.
- Practical application in identifying benchmark contamination in training corpora, missed by other approaches.
- Online demo available for fast, soft search across corpora in seven languages.