Hasty Briefsbeta

4x faster LLM inference (Flash Attention guy's company)

11 hours ago
  • #Inference Optimization
  • #AI
  • #Machine Learning
  • ATLAS introduces an adaptive-learning speculator system for LLM inference, offering up to 4x faster performance.
  • Unlike static speculators, ATLAS dynamically improves at runtime, learning from historical and live traffic patterns.
  • ATLAS achieves up to 500 TPS on DeepSeek-V3.1 and 460 TPS on Kimi-K2, outperforming standard decoding and specialized hardware like Groq.
  • Speculative decoding accelerates inference by using a draft model to propose tokens ahead, verified by the target model in parallel.
  • ATLAS combines a heavyweight static speculator and a lightweight adaptive speculator, with a confidence-aware controller for optimal performance.
  • The system is particularly effective in RL training, where it adapts to evolving policies, reducing rollout times.
  • ATLAS is part of Together Turbo’s optimization suite, integrating with other techniques like quantization and TurboBoost-TTFT for end-to-end acceleration.
  • The adaptive system excels in narrow input distributions, achieving extreme peak efficiency with up to 500 TPS on DeepSeek-V3.1.
  • Together AI is hiring research scientists and engineers to advance efficient AI deployment.