Integer Quantization: Deep Dive
4 hours ago
- #quantization
- #efficiency
- #transformer
- Quantization reduces memory usage by representing values in fewer bits, e.g., 8-bit reduces memory by 2x and 4-bit by 4x compared to 16-bit.
- It offers hardware advantages: integer arithmetic consumes less energy (e.g., int8 add uses 30x less energy than fp32 add) and is faster with lower silicon area.
- Benefits vary by workload: improves throughput in compute-bound tasks (e.g., LLM prefill) and reduces memory bandwidth in memory-bound tasks (e.g., LLM decoding).
- Quantization involves scaling and shifting floating-point values to an integer grid using scale (s) and zero-point (z), with clamping to handle out-of-range values.
- Fake quantization simulates quantization effects on general hardware by inserting quantize-dequantize pairs, enabling studies like Quantization-Aware Training (QAT).
- Quantization error consists of rounding error (from mapping to grid points) and clipping error (from out-of-range values), requiring a balance between them.
- Parameters like scale and offset can be set via methods like min-max or abs-max quantization, but outliers may necessitate loss-aware techniques or range clipping.
- Quantization categories include affine (asymmetric) vs. symmetric mapping, per-tensor, per-channel, or per-block granularity, and static vs. dynamic activation quantization.
- In practice, symmetric weights with asymmetric activations are common to avoid data-dependent costs, and per-channel quantization is used for weights but avoided for activations due to hardware inefficiency.
- Integer models run on Multiply-Accumulate (MAC) units using low-precision inputs, accumulate in higher precision (e.g., int32), and requantize outputs for subsequent layers.