Show HN: Tokenflood – simulate arbitrary loads on instruction-tuned LLMs
11 days ago
- #LLM
- #load-testing
- #performance
- Tokenflood is a load testing tool for instruction-tuned LLMs, enabling arbitrary load profiles without specific prompt/response data.
- It allows defining prompt lengths, prefix lengths, output lengths, and request rates to simulate workloads.
- Tokenflood uses litellm and supports all providers covered by litellm.
- Caution: High costs can occur if misconfigured with pay-per-token services; ensure workloads stay within budget.
- Common usage scenarios include load testing self-hosted LLMs, assessing hardware/quantization effects, and evaluating hosted LLM providers.
- Example graphs show latency impacts of changing prompt parameters (e.g., increasing prefix tokens or reducing output tokens).
- Tokenflood provides heuristic load testing without needing specific prompt data, using token sets to generate inputs.
- Safety measures include token usage estimation, budget limits, error rate monitoring, and warm-up request checks.
- Installation involves pip and vllm setup, with quick start configs for initial runs.
- Endpoint specifications define the target (provider, model, base_url, etc.), supporting various LLM providers.
- Run suites define test phases with request rates, load types, and token budgets for controlled testing.
- Contributions are welcome, with guidelines for forking, testing, and submitting pull requests.