Hasty Briefsbeta

Bilingual

HySparse: A Hybrid Sparse Attention Architecture

11 hours ago
  • #attention mechanisms
  • #efficiency
  • #machine learning
  • HySparse is a hybrid sparse attention architecture combining full and sparse attention layers.
  • It uses full attention layers as oracles for token selection, eliminating the need for additional proxies.
  • HySparse enables sparse layers to reuse KV caches from full attention, reducing computation and memory.
  • Evaluated on 7B dense and 80B MoE models, HySparse outperforms full attention and hybrid SWA baselines.
  • In an 80B MoE model, HySparse reduces KV cache storage by nearly 10x while maintaining performance.