DeepSeek-V4 on Day 0: From Fast Inference to Verified RL with SGLang and Miles
10 hours ago
- #Miles
- #DeepSeek-V4
- #SGLang
- SGLang and Miles provide Day-0 support for DeepSeek-V4 inference and RL training, optimized for its hybrid architecture.
- Key inference features include ShadowRadix prefix caching, HiSparse CPU-extended KV, in-graph speculative decoding, and fast kernel integrations like FlashMLA and FlashInfer.
- Optimizations like Flash Compressor and Lightning TopK reduce HBM round-trips and latency for sparse attention.
- Parallelism strategies (DP, TP, SP, EP, PP, CP) and hierarchical multi-stream overlap enhance throughput and scalability.
- RL training with Miles supports full parallelism, FP8 training, and stability features like R3 and indexer replay.
- Benchmarks show SGLang maintains near-flat decode throughput from 4K to 900K context on B200 and H200 GPUs.
- Future work is tracked in SGLang and Miles repositories, with acknowledgments to collaborators and contributors.