KV Sharing, MHC, and Compressed Attention
a day ago
- #KV Cache Optimization
- #LLM Architecture
- #Long-context Efficiency
- New LLM architectures focus on long-context efficiency to reduce costs associated with KV-cache size, memory traffic, and attention.
- Gemma 4 introduces KV sharing and per-layer embeddings: KV sharing reduces cache size by reusing KV tensors across layers, while PLE adds capacity via embedding tables without scaling the transformer stack.
- Laguna XS.2 employs layer-wise attention budgeting: It varies query head counts per layer, giving more heads to sliding-window layers and fewer to global layers to optimize attention capacity.
- ZAYA1-8B uses Compressed Convolutional Attention (CCA): CCA performs attention in a compressed latent space, reducing both KV cache size and attention FLOPs.
- DeepSeek V4 incorporates manifold-constrained hyper-connections (mHC) and compressed attention (CSA/HCA): mHC widens the residual stream with constrained mappings; CSA/HCA compresses the sequence dimension to lower long-context costs.