DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark

5 hours ago

Recipe for DeepSeek V4 Flash running on 2-node Spark successfully implemented.
Hardware setup includes 2x DGX Spark nodes with 128GB unified memory and direct QSFP56 cable via RoCE.
Used TP=2 with distributed-executor-backend mp, no Ray, and built on specific vLLM fork commits.
Observed performance: ~44 token/s decode, 2s TTFT warm, 6 min cold start, up to 200k context with limitations.
Challenges: Pin NCCL commit critical for cross-node init, issues with image copy, and hardware link troubleshooting.

Hasty Briefsbeta