Hasty Briefsbeta

Bilingual

Dispersion loss counteracts embedding condensation in small language models

6 hours ago
  • #Dispersion Loss
  • #Embedding Condensation
  • #Model Geometry
  • Embedding condensation is a geometric phenomenon where token embeddings in Transformers become increasingly similar, confined to a narrow cone as measured by cosine similarity.
  • This condensation is more severe in smaller models compared to larger ones, reproducible under controlled settings, emerges at initialization, and is not resolved by knowledge distillation.
  • The paper proposes a dispersion loss training objective to counteract embedding condensation, aiming to improve the expressivity and performance of smaller language models without increasing parameters.