Hasty Briefsbeta

Bilingual

DeepSeek won the best paper award at ACL 2025

9 months ago
  • #Efficient Computing
  • #Machine Learning
  • #Natural Language Processing
  • NSA (Natively Sparse Attention) is introduced for efficient long-context modeling in language models.
  • It combines dynamic hierarchical sparse strategy with hardware-aligned optimizations for speed and efficiency.
  • Key innovations include arithmetic intensity-balanced algorithm design and end-to-end trainability.
  • NSA maintains or exceeds Full Attention model performance across various benchmarks and tasks.
  • Substantial speedups are achieved over Full Attention on 64k-length sequences in decoding and propagation phases.