Two Leaps to 1000 Tokens/s on a 1T-Parameter Model

5 hours ago

The optimization of LLM inference systems has historically focused on kernels, operators, and scheduling to maximize hardware utilization.
The shift to ultra-low latency applications has exposed previously hidden execution overheads, making microsecond-level optimization critical.
TileRT introduces a Persistent Engine that consolidates the computational pipeline to eliminate execution gaps caused by operator boundaries.
Persistent execution enables continuous prefetching and tile-level pipelining, improving data movement overlap within the hardware.
Warp Specialization and Heterogeneous Workers transform the GPU into a coordinated, heterogeneous execution system.
Breaking the 1000 TPS barrier requires hardware-software co-design to address microsecond-scale bottlenecks like auxiliary operations.
Collaboration with Xiaomi's MiMo team led to optimizations such as mixed-precision quantization and DFlash integration to reduce latency.
Speed is emerging as a new scaling law, where inference speed directly influences model capabilities and application viability.
Achieving 1000+ TPS represents a paradigm shift towards integrated evolution of models and systems in ultra-low latency regimes.

Hasty Briefsbeta