All About Transformer Inference
15 days ago
- #Transformer Inference
- #Model Sharding
- #KV Cache Optimization
- Inference on Transformers differs from training due to latency considerations.
- Sampling involves generating tokens sequentially using log-probabilities from the model.
- KV cache optimization reduces the computational complexity from O(n²) to O(n) for FFW and O(n²) for attention.
- Inference is divided into prefill (compute-bound) and generation (memory bandwidth-bound) phases.
- Latency is a critical factor in inference, affecting Time To First Token (TTFT) and per-token latency.
- Model sharding strategies differ between prefill and generation, with generation favoring more sharding to reduce latency.
- KV cache size significantly impacts memory usage and performance, with optimizations like GMQA reducing its footprint.
- Disaggregated serving separates prefill and generation tasks to optimize latency and throughput.
- Continuous batching and speculative sampling are techniques to improve throughput and reduce latency.
- Quantization and architectural modifications like local attention layers can further optimize inference performance.