All About Transformer Inference

15 days ago

Copy Link

Inference on Transformers differs from training due to latency considerations.
Sampling involves generating tokens sequentially using log-probabilities from the model.
KV cache optimization reduces the computational complexity from O(n²) to O(n) for FFW and O(n²) for attention.
Inference is divided into prefill (compute-bound) and generation (memory bandwidth-bound) phases.
Latency is a critical factor in inference, affecting Time To First Token (TTFT) and per-token latency.
Model sharding strategies differ between prefill and generation, with generation favoring more sharding to reduce latency.
KV cache size significantly impacts memory usage and performance, with optimizations like GMQA reducing its footprint.
Disaggregated serving separates prefill and generation tasks to optimize latency and throughput.
Continuous batching and speculative sampling are techniques to improve throughput and reduce latency.
Quantization and architectural modifications like local attention layers can further optimize inference performance.

Hasty Briefsbeta