Hasty Briefsbeta

Bilingual

Integer Quantization: Deep Dive

4 hours ago
  • #quantization
  • #efficiency
  • #transformer
  • Quantization reduces memory usage by representing values in fewer bits, e.g., 8-bit reduces memory by 2x and 4-bit by 4x compared to 16-bit.
  • It offers hardware advantages: integer arithmetic consumes less energy (e.g., int8 add uses 30x less energy than fp32 add) and is faster with lower silicon area.
  • Benefits vary by workload: improves throughput in compute-bound tasks (e.g., LLM prefill) and reduces memory bandwidth in memory-bound tasks (e.g., LLM decoding).
  • Quantization involves scaling and shifting floating-point values to an integer grid using scale (s) and zero-point (z), with clamping to handle out-of-range values.
  • Fake quantization simulates quantization effects on general hardware by inserting quantize-dequantize pairs, enabling studies like Quantization-Aware Training (QAT).
  • Quantization error consists of rounding error (from mapping to grid points) and clipping error (from out-of-range values), requiring a balance between them.
  • Parameters like scale and offset can be set via methods like min-max or abs-max quantization, but outliers may necessitate loss-aware techniques or range clipping.
  • Quantization categories include affine (asymmetric) vs. symmetric mapping, per-tensor, per-channel, or per-block granularity, and static vs. dynamic activation quantization.
  • In practice, symmetric weights with asymmetric activations are common to avoid data-dependent costs, and per-channel quantization is used for weights but avoided for activations due to hardware inefficiency.
  • Integer models run on Multiply-Accumulate (MAC) units using low-precision inputs, accumulate in higher precision (e.g., int32), and requantize outputs for subsequent layers.