Hasty Briefsbeta

Bilingual

GPU Memory Math for LLMs: Formula That Tells You What Fits on Your GPU

3 hours ago
  • #Model Quantization
  • #GPU Memory
  • #LLM Inference
  • Use VRAM (GB) ≈ Parameters (billions) * (effective bits per weight / 8) to estimate GPU memory for LLMs
  • Common memory footprints: FP16/BF16 ~2GB/1B parameters, FP8/INT8 ~1GB/1B, 4-bit ~0.5GB/1B
  • KV cache, activations, batching, and framework overhead significantly increase VRAM needs beyond weights
  • Rule of thumb: Add 10-30% extra VRAM for safety, more for long contexts or high concurrency
  • Mixture-of-Experts (MoE) memory depends on total parameters, not active ones
  • GGUF memory efficiency is specific to llama.cpp runtime, not universal across frameworks
  • Design systems by asking 'How do I want to run this?' instead of 'Can I run this?'