Three types of LLM workloads and how to serve them
3 months ago
- #inference
- #workloads
- #LLM
- LLM workloads are categorized into three types: offline (batch mode, high throughput), online (streaming mode, low latency), and semi-online (bursty, flexible infrastructure).
- Offline workloads prioritize throughput per dollar, leveraging GPUs and mixed batching for efficiency. vLLM is recommended for these workloads.
- Online workloads require low latency and face challenges like host overhead and memory bandwidth limitations. SGLang with speculative decoding is recommended.
- Semi-online workloads need flexible scaling to handle variable demand. Solutions include multi-tenancy and GPU memory snapshotting to reduce cold starts.
- Future trends include more lossy optimizations for speed, exotic hardware for online workloads, and the rise of long-running agent applications.