HySparse: A Hybrid Sparse Attention Architecture
11 hours ago
- #attention mechanisms
- #efficiency
- #machine learning
- HySparse is a hybrid sparse attention architecture combining full and sparse attention layers.
- It uses full attention layers as oracles for token selection, eliminating the need for additional proxies.
- HySparse enables sparse layers to reuse KV caches from full attention, reducing computation and memory.
- Evaluated on 7B dense and 80B MoE models, HySparse outperforms full attention and hybrid SWA baselines.
- In an 80B MoE model, HySparse reduces KV cache storage by nearly 10x while maintaining performance.