Hasty Briefsbeta

Bilingual

DashAttention: Differentiable and Adaptable Sparse Hierarchical Attention

a day ago
  • #attention mechanism
  • #long context modeling
  • #sparse attention
  • DashAttention is a new hierarchical attention method that is fully differentiable and adaptive, using α-entmax to select variable numbers of relevant KV blocks per query.
  • It addresses limitations of existing methods like NSA and InfLLMv2, which use top-k selection with fixed token counts and block gradient flow between sparse and dense stages.
  • DashAttention is non-dispersive, enhancing long-context modeling ability compared to other hierarchical attention methods.
  • Experiments show it achieves accuracy comparable to full attention with 75% sparsity and outperforms NSA and InfLLMv2, especially in high-sparsity regimes.
  • An efficient GPU-aware implementation in Triton provides up to a certain speedup over FlashAttention-3 at inference time, offering a cost-effective strategy for long contexts.