Hasty Briefsbeta

Bilingual

KV Sharing, MHC, and Compressed Attention

a day ago
  • #KV Cache Optimization
  • #LLM Architecture
  • #Long-context Efficiency
  • New LLM architectures focus on long-context efficiency to reduce costs associated with KV-cache size, memory traffic, and attention.
  • Gemma 4 introduces KV sharing and per-layer embeddings: KV sharing reduces cache size by reusing KV tensors across layers, while PLE adds capacity via embedding tables without scaling the transformer stack.
  • Laguna XS.2 employs layer-wise attention budgeting: It varies query head counts per layer, giving more heads to sliding-window layers and fewer to global layers to optimize attention capacity.
  • ZAYA1-8B uses Compressed Convolutional Attention (CCA): CCA performs attention in a compressed latent space, reducing both KV cache size and attention FLOPs.
  • DeepSeek V4 incorporates manifold-constrained hyper-connections (mHC) and compressed attention (CSA/HCA): mHC widens the residual stream with constrained mappings; CSA/HCA compresses the sequence dimension to lower long-context costs.