Inference cost at scale with napkin math
4 days ago
- #Cost Estimation
- #AI Inference
- #GPU Scaling
- Estimating inference costs for serving AI models at scale involves understanding hardware specs and model architecture.
- Key GPU metrics include peak throughput (TFLOPs) and memory bandwidth (TB/s), typically assuming FP-8 quantization.
- Matrix multiplication cost is calculated as 2NMd memory accesses and 2NMd floating-point operations for matrices A (N×d) and B (d×M), reduced via tiling.
- LLMs are auto-regressive, predicting next tokens via attention layers and converting outputs to probabilities via a softmax over vocabulary.
- Attention mechanism computes Q, K, V matrices via linear transformations and applies softmax(QKᵀ/√d)V, with batching for multiple conversations.
- KV-cache reduces compute by storing intermediate K and V for processed tokens, allowing only new tokens to be processed per forward pass.
- Using NVIDIA B200 as example: 8 TB/s bandwidth and 4500 TFLOP/s compute intensity require ~331 concurrent users to balance compute and bandwidth.
- Realistic serving is limited by VRAM: a 32B model (32GB) with 200k context window requires ~210GB KV-cache, reduced to ~26GB with Grouped-Query-Attention (GQA).
- With 160GB free VRAM on B200, ~6 users can be served concurrently at full duty cycle, but variable context lengths and idle time allow ~300-800 users per GPU.
- Tokens per second: For 6 users, data movement takes ~23.75ms per forward pass, yielding ~40 tokens/user/second, sufficient for reading speeds.
- Cost per user: Owning a B200 at $40k results in ~$133 per user for 300 users; renting at $43/hour leads to ~$0.013/user/hour or ~$9.36/month.