Quantization from the Ground Up
6 hours ago
- #LLM
- #efficiency
- #quantization
- Qwen-3-Coder-Next is an 80 billion parameter model requiring 159.4GB of RAM.
- Frontier models may have over 1 trillion parameters, needing at least 2TB of RAM.
- Quantization can reduce LLM size by 4x and increase speed by 2x with minimal accuracy loss (5-10%).
- Parameters (weights) are the core of LLMs, represented as billions of operations in a graph structure.
- LLMs use layers of nodes with parameters, scaling up to billions or trillions in modern models.
- Floating-point numbers in computers compromise precision for range, using sign, exponent, and significand bits.
- Most LLM parameters cluster near zero, making them ideal for efficient floating-point representation.
- Quantization maps values from a large range to a smaller one, using techniques like symmetric and asymmetric scaling.
- Block quantization (32-256 parameters at a time) mitigates outlier impact on model quality.
- Quality metrics for quantized models include perplexity, KL divergence, benchmark scores, and conversational tests.
- Quantized models (8-bit and 4-bit) show minimal quality loss, while 2-bit quantization often fails.
- Smaller quantizations (e.g., 4-bit) can run faster due to reduced data movement in GPUs.
- Post-training quantization (PTQ) differs from quantization-aware training (QAT), with QAT often yielding better results.
- Alternate quantization methods like AWQ and GPTQ offer different trade-offs.
- Model efficiency can also be improved via parameter pruning and distillation.