Hasty Briefsbeta

Bilingual

SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Corpora

12 hours ago
  • #search algorithm
  • #natural language processing
  • #big data
  • Introduction of SoftMatcha 2, an ultra-fast and flexible search algorithm for trillion-scale natural language corpora.
  • Achieves search times under 0.3 seconds while handling semantic variations like substitution, insertion, and deletion.
  • Utilizes string matching based on suffix arrays for scalability with corpus size.
  • Key algorithmic ideas include fast exact lookup via disk-aware design and dynamic corpus-aware pruning.
  • Theoretical demonstration of suppressing exponential growth in search space by leveraging natural language statistical properties.
  • Outperforms existing methods (infini-gram, infini-gram mini, SoftMatcha) on FineWeb-Edu (1.4T tokens) in search latency.
  • Practical application in identifying benchmark contamination in training corpora, missed by other approaches.
  • Online demo available for fast, soft search across corpora in seven languages.