Hasty Briefsbeta

Bilingual

BM25

7 hours ago
  • #BM25
  • #search-algorithms
  • #information-retrieval
  • BM25 is a long-standing algorithm in information retrieval, originating from the 1970s-1980s and adopted by Elasticsearch, Solr, and Lucene.
  • BM25 improves upon TF-IDF by addressing its linear term frequency (TF) issue and lack of document length normalization.
  • The algorithm uses a saturation function for TF to prevent over-scoring documents with repeated terms and normalizes scores based on document length relative to the corpus average.
  • BM25's interpretability allows for debugging and tuning via parameters like k1 (saturation speed) and b (length normalization strength).
  • Limitations include being a bag-of-words model, meaning it doesn't handle synonyms, word order, or semantic intent.
  • BM25 is ideal for keyword-heavy searches but should be combined with dense retrieval methods for semantic queries.
  • In Elasticsearch, BM25 scores can vary per shard, and document lengths are approximated, affecting normalization.
  • The algorithm's debuggability is a key advantage, as scores can be traced to specific term statistics.