Three types of LLM workloads and how to serve them

3 months ago

LLM workloads are categorized into three types: offline (batch mode, high throughput), online (streaming mode, low latency), and semi-online (bursty, flexible infrastructure).
Offline workloads prioritize throughput per dollar, leveraging GPUs and mixed batching for efficiency. vLLM is recommended for these workloads.
Online workloads require low latency and face challenges like host overhead and memory bandwidth limitations. SGLang with speculative decoding is recommended.
Semi-online workloads need flexible scaling to handle variable demand. Solutions include multi-tenancy and GPU memory snapshotting to reduce cold starts.
Future trends include more lossy optimizations for speed, exotic hardware for online workloads, and the rise of long-running agent applications.

Hasty Briefsbeta