Cutting inference cold starts by 40x with LP, FUSE, C/R, and CUDA-checkpoint

2 days ago

Modal uses serverless computing to optimize AI inference workloads, which are variable and unpredictable, by leveraging cloud buffers, a custom filesystem, and checkpoint/restore mechanisms.
The system reduces GPU inference replica scaling from kiloseconds to tens of seconds, cutting cold starts by 40x (from 2000 seconds to 50 seconds).
Key techniques include: cloud buffers with idle GPUs for quick allocation, a custom FUSE-based filesystem for lazy loading of container images, CPU checkpoint/restore to skip host-side initialization, and CUDA checkpoint/restore to skip GPU-side setup.
These optimizations maximize GPU Allocation Utilization by matching supply to demand, avoiding over-provisioning and underutilization.
Implementation details involve tiered caching, tuning of libfuse parameters, skipping gzip compression, and integrating with gVisor for container security and checkpointing.
Benchmarks show significant speedups: vLLM and SGLang server replicas boot up 4-10x faster with snapshots, reducing latency from minutes to seconds.
The system supports millions of replicas across various use cases, including Reducto's document processing, which scales to thousands of GPUs with low cold starts.
Modal aims to improve AI infrastructure efficiency and invites engineers to join their efforts in advancing cloud systems for AI workloads.

Hasty Briefsbeta