Hasty Briefsbeta

Bilingual

Three types of LLM workloads and how to serve them

3 months ago
  • #inference
  • #workloads
  • #LLM
  • LLM workloads are categorized into three types: offline (batch mode, high throughput), online (streaming mode, low latency), and semi-online (bursty, flexible infrastructure).
  • Offline workloads prioritize throughput per dollar, leveraging GPUs and mixed batching for efficiency. vLLM is recommended for these workloads.
  • Online workloads require low latency and face challenges like host overhead and memory bandwidth limitations. SGLang with speculative decoding is recommended.
  • Semi-online workloads need flexible scaling to handle variable demand. Solutions include multi-tenancy and GPU memory snapshotting to reduce cold starts.
  • Future trends include more lossy optimizations for speed, exotic hardware for online workloads, and the rise of long-running agent applications.