Dense Retrievers Know More Than They Can Express

5 hours ago

Retrieval models learn richer representations than they can express due to scoring operator limitations.
Single-vector retrieval is limited by cosine similarity scoring, while multi-vector models like ColBERT use MaxSim for finer expressiveness.
Sparse AutoEncoders (SAEs) extract a 'latent vocabulary' from neural retrievers with a Zipfian distribution, similar to natural language.
Latent Terms from SAEs include lexical, narrow semantic, and broad topical features, usable with lexical retrieval methods like BM25.
SAE-extracted Latent Terms improve retrieval performance, outperforming single-vector models and competing with methods like SPLADE on benchmarks such as LIMIT.
This latent structure emerges from retrieval-focused training, not from pretrained language models alone, indicating models learn untapped signals.

Hasty Briefsbeta