Hasty Briefsbeta

Bilingual

Tokasaurus: An LLM Inference Engine for High-Throughput Workloads

a year ago
  • #inference-engine
  • #LLM
  • #high-throughput
  • Tokasaurus is a new LLM inference engine optimized for high-throughput workloads.
  • It excels with small models by minimizing CPU overhead and using dynamic Hydragen grouping for shared prefixes.
  • For larger models, Tokasaurus supports async tensor parallelism for GPUs with NVLink and pipeline parallelism for those without.
  • Tokasaurus can outperform vLLM and SGLang by up to 3x in throughput-focused benchmarks.
  • Key optimizations include adaptive CPU management and dynamic prefix identification.
  • Tokasaurus is available on GitHub and PyPI, supporting models from Llama-3 and Qwen-2 families.
  • Benchmarks show significant throughput improvements, especially in shared-prefix scenarios.
  • Acknowledgements include Prime Intellect and Together AI for compute support.