BM25

7 hours ago

BM25 is a long-standing algorithm in information retrieval, originating from the 1970s-1980s and adopted by Elasticsearch, Solr, and Lucene.
BM25 improves upon TF-IDF by addressing its linear term frequency (TF) issue and lack of document length normalization.
The algorithm uses a saturation function for TF to prevent over-scoring documents with repeated terms and normalizes scores based on document length relative to the corpus average.
BM25's interpretability allows for debugging and tuning via parameters like k1 (saturation speed) and b (length normalization strength).
Limitations include being a bag-of-words model, meaning it doesn't handle synonyms, word order, or semantic intent.
BM25 is ideal for keyword-heavy searches but should be combined with dense retrieval methods for semantic queries.
In Elasticsearch, BM25 scores can vary per shard, and document lengths are approximated, affecting normalization.
The algorithm's debuggability is a key advantage, as scores can be traced to specific term statistics.

Hasty Briefsbeta