4x faster LLM inference (Flash Attention guy's company)
11 hours ago
- #Inference Optimization
- #AI
- #Machine Learning
- ATLAS introduces an adaptive-learning speculator system for LLM inference, offering up to 4x faster performance.
- Unlike static speculators, ATLAS dynamically improves at runtime, learning from historical and live traffic patterns.
- ATLAS achieves up to 500 TPS on DeepSeek-V3.1 and 460 TPS on Kimi-K2, outperforming standard decoding and specialized hardware like Groq.
- Speculative decoding accelerates inference by using a draft model to propose tokens ahead, verified by the target model in parallel.
- ATLAS combines a heavyweight static speculator and a lightweight adaptive speculator, with a confidence-aware controller for optimal performance.
- The system is particularly effective in RL training, where it adapts to evolving policies, reducing rollout times.
- ATLAS is part of Together Turbo’s optimization suite, integrating with other techniques like quantization and TurboBoost-TTFT for end-to-end acceleration.
- The adaptive system excels in narrow input distributions, achieving extreme peak efficiency with up to 500 TPS on DeepSeek-V3.1.
- Together AI is hiring research scientists and engineers to advance efficient AI deployment.