KV Sharing, MHC, and Compressed Attention

a day ago

New LLM architectures focus on long-context efficiency to reduce costs associated with KV-cache size, memory traffic, and attention.
Gemma 4 introduces KV sharing and per-layer embeddings: KV sharing reduces cache size by reusing KV tensors across layers, while PLE adds capacity via embedding tables without scaling the transformer stack.
Laguna XS.2 employs layer-wise attention budgeting: It varies query head counts per layer, giving more heads to sliding-window layers and fewer to global layers to optimize attention capacity.
ZAYA1-8B uses Compressed Convolutional Attention (CCA): CCA performs attention in a compressed latent space, reducing both KV cache size and attention FLOPs.
DeepSeek V4 incorporates manifold-constrained hyper-connections (mHC) and compressed attention (CSA/HCA): mHC widens the residual stream with constrained mappings; CSA/HCA compresses the sequence dimension to lower long-context costs.

Hasty Briefsbeta