Tokasaurus: An LLM Inference Engine for High-Throughput Workloads
a year ago
- #inference-engine
- #LLM
- #high-throughput
- Tokasaurus is a new LLM inference engine optimized for high-throughput workloads.
- It excels with small models by minimizing CPU overhead and using dynamic Hydragen grouping for shared prefixes.
- For larger models, Tokasaurus supports async tensor parallelism for GPUs with NVLink and pipeline parallelism for those without.
- Tokasaurus can outperform vLLM and SGLang by up to 3x in throughput-focused benchmarks.
- Key optimizations include adaptive CPU management and dynamic prefix identification.
- Tokasaurus is available on GitHub and PyPI, supporting models from Llama-3 and Qwen-2 families.
- Benchmarks show significant throughput improvements, especially in shared-prefix scenarios.
- Acknowledgements include Prime Intellect and Together AI for compute support.