DeepSeek-V4 on Day 0: From Fast Inference to Verified RL with SGLang and Miles

a month ago

SGLang and Miles provide Day-0 support for DeepSeek-V4 inference and RL training, optimized for its hybrid architecture.
Key inference features include ShadowRadix prefix caching, HiSparse CPU-extended KV, in-graph speculative decoding, and fast kernel integrations like FlashMLA and FlashInfer.
Optimizations like Flash Compressor and Lightning TopK reduce HBM round-trips and latency for sparse attention.
Parallelism strategies (DP, TP, SP, EP, PP, CP) and hierarchical multi-stream overlap enhance throughput and scalability.
RL training with Miles supports full parallelism, FP8 training, and stability features like R3 and indexer replay.
Benchmarks show SGLang maintains near-flat decode throughput from 4K to 900K context on B200 and H200 GPUs.
Future work is tracked in SGLang and Miles repositories, with acknowledgments to collaborators and contributors.

Hasty Briefsbeta