Hasty Briefsbeta

Bilingual

Cutting inference cold starts by 40x with LP, FUSE, C/R, and CUDA-checkpoint

2 days ago
  • #AI Inference
  • #Serverless Computing
  • #GPU Optimization
  • Modal uses serverless computing to optimize AI inference workloads, which are variable and unpredictable, by leveraging cloud buffers, a custom filesystem, and checkpoint/restore mechanisms.
  • The system reduces GPU inference replica scaling from kiloseconds to tens of seconds, cutting cold starts by 40x (from 2000 seconds to 50 seconds).
  • Key techniques include: cloud buffers with idle GPUs for quick allocation, a custom FUSE-based filesystem for lazy loading of container images, CPU checkpoint/restore to skip host-side initialization, and CUDA checkpoint/restore to skip GPU-side setup.
  • These optimizations maximize GPU Allocation Utilization by matching supply to demand, avoiding over-provisioning and underutilization.
  • Implementation details involve tiered caching, tuning of libfuse parameters, skipping gzip compression, and integrating with gVisor for container security and checkpointing.
  • Benchmarks show significant speedups: vLLM and SGLang server replicas boot up 4-10x faster with snapshots, reducing latency from minutes to seconds.
  • The system supports millions of replicas across various use cases, including Reducto's document processing, which scales to thousands of GPUs with low cold starts.
  • Modal aims to improve AI infrastructure efficiency and invites engineers to join their efforts in advancing cloud systems for AI workloads.