GPU Memory Math for LLMs: Formula That Tells You What Fits on Your GPU

3 hours ago

Use VRAM (GB) ≈ Parameters (billions) * (effective bits per weight / 8) to estimate GPU memory for LLMs
Common memory footprints: FP16/BF16 ~2GB/1B parameters, FP8/INT8 ~1GB/1B, 4-bit ~0.5GB/1B
KV cache, activations, batching, and framework overhead significantly increase VRAM needs beyond weights
Rule of thumb: Add 10-30% extra VRAM for safety, more for long contexts or high concurrency
Mixture-of-Experts (MoE) memory depends on total parameters, not active ones
GGUF memory efficiency is specific to llama.cpp runtime, not universal across frameworks
Design systems by asking 'How do I want to run this?' instead of 'Can I run this?'

Hasty Briefsbeta