Dispersion loss counteracts embedding condensation in small language models
4 hours ago
- #Dispersion Loss
- #Embedding Condensation
- #Model Geometry
- Embedding condensation is a geometric phenomenon where token embeddings in Transformers become increasingly similar, confined to a narrow cone as measured by cosine similarity.
- This condensation is more severe in smaller models compared to larger ones, reproducible under controlled settings, emerges at initialization, and is not resolved by knowledge distillation.
- The paper proposes a dispersion loss training objective to counteract embedding condensation, aiming to improve the expressivity and performance of smaller language models without increasing parameters.