Hasty Briefsbeta

All About Transformer Inference

15 days ago
  • #Transformer Inference
  • #Model Sharding
  • #KV Cache Optimization
  • Inference on Transformers differs from training due to latency considerations.
  • Sampling involves generating tokens sequentially using log-probabilities from the model.
  • KV cache optimization reduces the computational complexity from O(n²) to O(n) for FFW and O(n²) for attention.
  • Inference is divided into prefill (compute-bound) and generation (memory bandwidth-bound) phases.
  • Latency is a critical factor in inference, affecting Time To First Token (TTFT) and per-token latency.
  • Model sharding strategies differ between prefill and generation, with generation favoring more sharding to reduce latency.
  • KV cache size significantly impacts memory usage and performance, with optimizations like GMQA reducing its footprint.
  • Disaggregated serving separates prefill and generation tasks to optimize latency and throughput.
  • Continuous batching and speculative sampling are techniques to improve throughput and reduce latency.
  • Quantization and architectural modifications like local attention layers can further optimize inference performance.