4x faster LLM inference (Flash Attention guy's company)

11 hours ago

Copy Link

ATLAS introduces an adaptive-learning speculator system for LLM inference, offering up to 4x faster performance.
Unlike static speculators, ATLAS dynamically improves at runtime, learning from historical and live traffic patterns.
ATLAS achieves up to 500 TPS on DeepSeek-V3.1 and 460 TPS on Kimi-K2, outperforming standard decoding and specialized hardware like Groq.
Speculative decoding accelerates inference by using a draft model to propose tokens ahead, verified by the target model in parallel.
ATLAS combines a heavyweight static speculator and a lightweight adaptive speculator, with a confidence-aware controller for optimal performance.
The system is particularly effective in RL training, where it adapts to evolving policies, reducing rollout times.
ATLAS is part of Together Turbo’s optimization suite, integrating with other techniques like quantization and TurboBoost-TTFT for end-to-end acceleration.
The adaptive system excels in narrow input distributions, achieving extreme peak efficiency with up to 500 TPS on DeepSeek-V3.1.
Together AI is hiring research scientists and engineers to advance efficient AI deployment.

Hasty Briefsbeta