Dense Retrievers Know More Than They Can Express
5 hours ago
- #retrieval models
- #sparse autoencoders
- #latent vocabulary
- Retrieval models learn richer representations than they can express due to scoring operator limitations.
- Single-vector retrieval is limited by cosine similarity scoring, while multi-vector models like ColBERT use MaxSim for finer expressiveness.
- Sparse AutoEncoders (SAEs) extract a 'latent vocabulary' from neural retrievers with a Zipfian distribution, similar to natural language.
- Latent Terms from SAEs include lexical, narrow semantic, and broad topical features, usable with lexical retrieval methods like BM25.
- SAE-extracted Latent Terms improve retrieval performance, outperforming single-vector models and competing with methods like SPLADE on benchmarks such as LIMIT.
- This latent structure emerges from retrieval-focused training, not from pretrained language models alone, indicating models learn untapped signals.