Tokasaurus: An LLM Inference Engine for High-Throughput Workloads

a year ago

Tokasaurus is a new LLM inference engine optimized for high-throughput workloads.
It excels with small models by minimizing CPU overhead and using dynamic Hydragen grouping for shared prefixes.
For larger models, Tokasaurus supports async tensor parallelism for GPUs with NVLink and pipeline parallelism for those without.
Tokasaurus can outperform vLLM and SGLang by up to 3x in throughput-focused benchmarks.
Key optimizations include adaptive CPU management and dynamic prefix identification.
Tokasaurus is available on GitHub and PyPI, supporting models from Llama-3 and Qwen-2 families.
Benchmarks show significant throughput improvements, especially in shared-prefix scenarios.
Acknowledgements include Prime Intellect and Together AI for compute support.

Hasty Briefsbeta