The Evolution of 'More Like This'

7 hours ago

More Like This (MLT) enables search for documents similar to an already selected one, useful in various scenarios like article reading, product browsing, or support ticket investigation.
Traditional MLT was lexical, based on matching important words via techniques like TF-IDF or BM25, effective for exact matches such as error codes, SKUs, or legal wording.
Embeddings allow MLT to shift to semantic search, comparing vector representations of documents, which captures meaning even with different phrasing, enhancing cross-lingual and conceptual similarity.
Hybrid search combines lexical and vector approaches, leveraging strengths of both: lexical for precise matches and vector for semantic relationships, with reranking and filters refining results.
Modern implementations integrate MLT within search engines like Manticore, enabling KNN/ANN searches directly via document IDs, reducing complexity and improving performance in production systems.
MLT's evolution spans from 2000s lexical methods, through 2010s embeddings like Word2Vec, to recent advances with ANN libraries and RAG, supporting context expansion and personalized retrieval.
Key considerations for production include exact match requirements, embedding model management, access controls, hybrid search tuning, reranking, and monitoring search quality metrics.

Hasty Briefsbeta