SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Corpora

12 hours ago

Introduction of SoftMatcha 2, an ultra-fast and flexible search algorithm for trillion-scale natural language corpora.
Achieves search times under 0.3 seconds while handling semantic variations like substitution, insertion, and deletion.
Utilizes string matching based on suffix arrays for scalability with corpus size.
Key algorithmic ideas include fast exact lookup via disk-aware design and dynamic corpus-aware pruning.
Theoretical demonstration of suppressing exponential growth in search space by leveraging natural language statistical properties.
Outperforms existing methods (infini-gram, infini-gram mini, SoftMatcha) on FineWeb-Edu (1.4T tokens) in search latency.
Practical application in identifying benchmark contamination in training corpora, missed by other approaches.
Online demo available for fast, soft search across corpora in seven languages.

Hasty Briefsbeta