BM25
7 hours ago
- #BM25
- #search-algorithms
- #information-retrieval
- BM25 is a long-standing algorithm in information retrieval, originating from the 1970s-1980s and adopted by Elasticsearch, Solr, and Lucene.
- BM25 improves upon TF-IDF by addressing its linear term frequency (TF) issue and lack of document length normalization.
- The algorithm uses a saturation function for TF to prevent over-scoring documents with repeated terms and normalizes scores based on document length relative to the corpus average.
- BM25's interpretability allows for debugging and tuning via parameters like k1 (saturation speed) and b (length normalization strength).
- Limitations include being a bag-of-words model, meaning it doesn't handle synonyms, word order, or semantic intent.
- BM25 is ideal for keyword-heavy searches but should be combined with dense retrieval methods for semantic queries.
- In Elasticsearch, BM25 scores can vary per shard, and document lengths are approximated, affecting normalization.
- The algorithm's debuggability is a key advantage, as scores can be traced to specific term statistics.