Dispersion loss counteracts embedding condensation in small language models

4 hours ago

Embedding condensation is a geometric phenomenon where token embeddings in Transformers become increasingly similar, confined to a narrow cone as measured by cosine similarity.
This condensation is more severe in smaller models compared to larger ones, reproducible under controlled settings, emerges at initialization, and is not resolved by knowledge distillation.
The paper proposes a dispersion loss training objective to counteract embedding condensation, aiming to improve the expressivity and performance of smaller language models without increasing parameters.

Hasty Briefsbeta