Inference cost at scale with napkin math

4 days ago

#Cost Estimation
#AI Inference
#GPU Scaling

Estimating inference costs for serving AI models at scale involves understanding hardware specs and model architecture.
Key GPU metrics include peak throughput (TFLOPs) and memory bandwidth (TB/s), typically assuming FP-8 quantization.
Matrix multiplication cost is calculated as 2NMd memory accesses and 2NMd floating-point operations for matrices A (N×d) and B (d×M), reduced via tiling.
LLMs are auto-regressive, predicting next tokens via attention layers and converting outputs to probabilities via a softmax over vocabulary.
Attention mechanism computes Q, K, V matrices via linear transformations and applies softmax(QKᵀ/√d)V, with batching for multiple conversations.
KV-cache reduces compute by storing intermediate K and V for processed tokens, allowing only new tokens to be processed per forward pass.
Using NVIDIA B200 as example: 8 TB/s bandwidth and 4500 TFLOP/s compute intensity require ~331 concurrent users to balance compute and bandwidth.
Realistic serving is limited by VRAM: a 32B model (32GB) with 200k context window requires ~210GB KV-cache, reduced to ~26GB with Grouped-Query-Attention (GQA).
With 160GB free VRAM on B200, ~6 users can be served concurrently at full duty cycle, but variable context lengths and idle time allow ~300-800 users per GPU.
Tokens per second: For 6 users, data movement takes ~23.75ms per forward pass, yielding ~40 tokens/user/second, sufficient for reading speeds.
Cost per user: Owning a B200 at $40k results in ~$133 per user for 300 users; renting at $43/hour leads to ~$0.013/user/hour or ~$9.36/month.

Hasty Briefsbeta

Inference cost at scale with napkin math