Hasty Briefsbeta

Bilingual

DeepSeek V4's indexer OOMs at 65K context. We got it to 1M in 6G

20 hours ago
  • #Compressed Sparse Attention
  • #Top-k Selection
  • #Memory Optimization
  • StreamIndex is a Triton implementation of Compressed Sparse Attention (CSA) that uses a chunked partition-merge top-k driver to avoid materializing the full intermediate score tensor.
  • It extends the memory-bounded regime by 32x, handling sequences up to 1,048,576 tokens using only 6.21 GB of HBM, compared to the materialize path which OOMs at 65,536 tokens.
  • Set-overlap recall against the materialize ground truth is near-perfect (mean 1.0000, min 0.9980 across design sweeps), ensuring accuracy.
  • StreamIndex composes with TileLang's pipelined attention kernel, enabling execution at S=262,144 in 1.97 s while the materialize version OOMs.
  • The contribution focuses on the indexer step, not attention kernel speed or end-to-end real-checkpoint behavior.