DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark
5 hours ago
- #DeepSeek V4
- #Multi-Node Setup
- #AI Deployment
- Recipe for DeepSeek V4 Flash running on 2-node Spark successfully implemented.
- Hardware setup includes 2x DGX Spark nodes with 128GB unified memory and direct QSFP56 cable via RoCE.
- Used TP=2 with distributed-executor-backend mp, no Ray, and built on specific vLLM fork commits.
- Observed performance: ~44 token/s decode, 2s TTFT warm, 6 min cold start, up to 200k context with limitations.
- Challenges: Pin NCCL commit critical for cross-node init, issues with image copy, and hardware link troubleshooting.