Hasty Briefsbeta

Bilingual

Inference cost at scale with napkin math

4 days ago
  • #Cost Estimation
  • #AI Inference
  • #GPU Scaling
  • Estimating inference costs for serving AI models at scale involves understanding hardware specs and model architecture.
  • Key GPU metrics include peak throughput (TFLOPs) and memory bandwidth (TB/s), typically assuming FP-8 quantization.
  • Matrix multiplication cost is calculated as 2NMd memory accesses and 2NMd floating-point operations for matrices A (N×d) and B (d×M), reduced via tiling.
  • LLMs are auto-regressive, predicting next tokens via attention layers and converting outputs to probabilities via a softmax over vocabulary.
  • Attention mechanism computes Q, K, V matrices via linear transformations and applies softmax(QKᵀ/√d)V, with batching for multiple conversations.
  • KV-cache reduces compute by storing intermediate K and V for processed tokens, allowing only new tokens to be processed per forward pass.
  • Using NVIDIA B200 as example: 8 TB/s bandwidth and 4500 TFLOP/s compute intensity require ~331 concurrent users to balance compute and bandwidth.
  • Realistic serving is limited by VRAM: a 32B model (32GB) with 200k context window requires ~210GB KV-cache, reduced to ~26GB with Grouped-Query-Attention (GQA).
  • With 160GB free VRAM on B200, ~6 users can be served concurrently at full duty cycle, but variable context lengths and idle time allow ~300-800 users per GPU.
  • Tokens per second: For 6 users, data movement takes ~23.75ms per forward pass, yielding ~40 tokens/user/second, sufficient for reading speeds.
  • Cost per user: Owning a B200 at $40k results in ~$133 per user for 300 users; renting at $43/hour leads to ~$0.013/user/hour or ~$9.36/month.