From Multi-Head to Latent Attention: The Evolution of Attention Mechanisms

11 days ago

Copy Link

Attention mechanisms allow models to focus on relevant parts of input context selectively.
Key components of attention include Query (Q), Key (K), Value (V), and Attention Scores.
Multi-Head Attention (MHA) uses multiple parallel attention heads but has high computational costs.
Multi-Query Attention (MQA) reduces overhead by sharing Key and Value vectors across heads.
Grouped Query Attention (GQA) balances MHA and MQA by grouping query heads and sharing Key-Value pairs.
Multi-Head Latent Attention (MHLA) compresses Key and Value vectors into a latent space for efficiency.
KV caching is used to store precomputed Key and Value vectors for faster inference.
Attention mechanisms are evolving to improve scalability, speed, and memory efficiency.

Hasty Briefsbeta