Integer Quantization: Deep Dive

4 hours ago

#quantization
#efficiency
#transformer

Quantization reduces memory usage by representing values in fewer bits, e.g., 8-bit reduces memory by 2x and 4-bit by 4x compared to 16-bit.
It offers hardware advantages: integer arithmetic consumes less energy (e.g., int8 add uses 30x less energy than fp32 add) and is faster with lower silicon area.
Benefits vary by workload: improves throughput in compute-bound tasks (e.g., LLM prefill) and reduces memory bandwidth in memory-bound tasks (e.g., LLM decoding).
Quantization involves scaling and shifting floating-point values to an integer grid using scale (s) and zero-point (z), with clamping to handle out-of-range values.
Fake quantization simulates quantization effects on general hardware by inserting quantize-dequantize pairs, enabling studies like Quantization-Aware Training (QAT).
Quantization error consists of rounding error (from mapping to grid points) and clipping error (from out-of-range values), requiring a balance between them.
Parameters like scale and offset can be set via methods like min-max or abs-max quantization, but outliers may necessitate loss-aware techniques or range clipping.
Quantization categories include affine (asymmetric) vs. symmetric mapping, per-tensor, per-channel, or per-block granularity, and static vs. dynamic activation quantization.
In practice, symmetric weights with asymmetric activations are common to avoid data-dependent costs, and per-channel quantization is used for weights but avoided for activations due to hardware inefficiency.
Integer models run on Multiply-Accumulate (MAC) units using low-precision inputs, accumulate in higher precision (e.g., int32), and requantize outputs for subsequent layers.

Hasty Briefsbeta

Integer Quantization: Deep Dive