DeepSeek V4's indexer OOMs at 65K context. We got it to 1M in 6G
21 hours ago
- #Compressed Sparse Attention
- #Top-k Selection
- #Memory Optimization
- StreamIndex is a Triton implementation of Compressed Sparse Attention (CSA) that uses a chunked partition-merge top-k driver to avoid materializing the full intermediate score tensor.
- It extends the memory-bounded regime by 32x, handling sequences up to 1,048,576 tokens using only 6.21 GB of HBM, compared to the materialize path which OOMs at 65,536 tokens.
- Set-overlap recall against the materialize ground truth is near-perfect (mean 1.0000, min 0.9980 across design sweeps), ensuring accuracy.
- StreamIndex composes with TileLang's pipelined attention kernel, enabling execution at S=262,144 in 1.97 s while the materialize version OOMs.
- The contribution focuses on the indexer step, not attention kernel speed or end-to-end real-checkpoint behavior.