Quantization from the Ground Up

a month ago

Qwen-3-Coder-Next is an 80 billion parameter model requiring 159.4GB of RAM.
Frontier models may have over 1 trillion parameters, needing at least 2TB of RAM.
Quantization can reduce LLM size by 4x and increase speed by 2x with minimal accuracy loss (5-10%).
Parameters (weights) are the core of LLMs, represented as billions of operations in a graph structure.
LLMs use layers of nodes with parameters, scaling up to billions or trillions in modern models.
Floating-point numbers in computers compromise precision for range, using sign, exponent, and significand bits.
Most LLM parameters cluster near zero, making them ideal for efficient floating-point representation.
Quantization maps values from a large range to a smaller one, using techniques like symmetric and asymmetric scaling.
Block quantization (32-256 parameters at a time) mitigates outlier impact on model quality.
Quality metrics for quantized models include perplexity, KL divergence, benchmark scores, and conversational tests.
Quantized models (8-bit and 4-bit) show minimal quality loss, while 2-bit quantization often fails.
Smaller quantizations (e.g., 4-bit) can run faster due to reduced data movement in GPUs.
Post-training quantization (PTQ) differs from quantization-aware training (QAT), with QAT often yielding better results.
Alternate quantization methods like AWQ and GPTQ offer different trade-offs.
Model efficiency can also be improved via parameter pruning and distillation.

Hasty Briefsbeta