Hasty Briefsbeta

Bilingual

DeepSeek-V4-Flash (official FP8) running across 2x DGX Spark

6 hours ago
  • #DeepSeek V4
  • #Multi-Node Setup
  • #AI Deployment
  • Recipe for DeepSeek V4 Flash running on 2-node Spark successfully implemented.
  • Hardware setup includes 2x DGX Spark nodes with 128GB unified memory and direct QSFP56 cable via RoCE.
  • Used TP=2 with distributed-executor-backend mp, no Ray, and built on specific vLLM fork commits.
  • Observed performance: ~44 token/s decode, 2s TTFT warm, 6 min cold start, up to 200k context with limitations.
  • Challenges: Pin NCCL commit critical for cross-node init, issues with image copy, and hardware link troubleshooting.