Hasty Briefsbeta

Bilingual

Dense Retrievers Know More Than They Can Express

5 hours ago
  • #retrieval models
  • #sparse autoencoders
  • #latent vocabulary
  • Retrieval models learn richer representations than they can express due to scoring operator limitations.
  • Single-vector retrieval is limited by cosine similarity scoring, while multi-vector models like ColBERT use MaxSim for finer expressiveness.
  • Sparse AutoEncoders (SAEs) extract a 'latent vocabulary' from neural retrievers with a Zipfian distribution, similar to natural language.
  • Latent Terms from SAEs include lexical, narrow semantic, and broad topical features, usable with lexical retrieval methods like BM25.
  • SAE-extracted Latent Terms improve retrieval performance, outperforming single-vector models and competing with methods like SPLADE on benchmarks such as LIMIT.
  • This latent structure emerges from retrieval-focused training, not from pretrained language models alone, indicating models learn untapped signals.