Popping the GPU Bubble

4 hours ago

The GPU often sits idle during AI model inference due to waiting for CPU instructions, a phenomenon called GPU bubbles.
Text generation is autoregressive, producing one token at a time sequentially, requiring round trips between CPU and GPU.
Pipelined decoding overlaps CPU and GPU work by launching the next token's GPU forward while the CPU processes the previous token.
Ping-pong slots involve two buffer sets to prevent collisions, allowing CPU to process results while GPU runs the next forward.
Forward now, sample later decouples GPU forward passes from sampling dependencies, enabling constrained decoding without special-casing.
Zombies handle finished sequences by marking them finalized but delaying release until inflight references drop to zero.
Prefill and decode work share the same pipeline, allowing overlap and preventing serialization bottlenecks.
Speed gains from pipelining scale with GPU speed; faster GPUs see larger improvements due to reduced bubble impact.
Pipelining reduces idle time significantly, with observed speedups ranging from ~6% to over 35% depending on hardware and batch size.

Hasty Briefsbeta