Hasty Briefsbeta

Bilingual

Quantization from the Ground Up

6 hours ago
  • #LLM
  • #efficiency
  • #quantization
  • Qwen-3-Coder-Next is an 80 billion parameter model requiring 159.4GB of RAM.
  • Frontier models may have over 1 trillion parameters, needing at least 2TB of RAM.
  • Quantization can reduce LLM size by 4x and increase speed by 2x with minimal accuracy loss (5-10%).
  • Parameters (weights) are the core of LLMs, represented as billions of operations in a graph structure.
  • LLMs use layers of nodes with parameters, scaling up to billions or trillions in modern models.
  • Floating-point numbers in computers compromise precision for range, using sign, exponent, and significand bits.
  • Most LLM parameters cluster near zero, making them ideal for efficient floating-point representation.
  • Quantization maps values from a large range to a smaller one, using techniques like symmetric and asymmetric scaling.
  • Block quantization (32-256 parameters at a time) mitigates outlier impact on model quality.
  • Quality metrics for quantized models include perplexity, KL divergence, benchmark scores, and conversational tests.
  • Quantized models (8-bit and 4-bit) show minimal quality loss, while 2-bit quantization often fails.
  • Smaller quantizations (e.g., 4-bit) can run faster due to reduced data movement in GPUs.
  • Post-training quantization (PTQ) differs from quantization-aware training (QAT), with QAT often yielding better results.
  • Alternate quantization methods like AWQ and GPTQ offer different trade-offs.
  • Model efficiency can also be improved via parameter pruning and distillation.