Show HN: Tokenflood – simulate arbitrary loads on instruction-tuned LLMs

11 days ago

Copy Link

Tokenflood is a load testing tool for instruction-tuned LLMs, enabling arbitrary load profiles without specific prompt/response data.
It allows defining prompt lengths, prefix lengths, output lengths, and request rates to simulate workloads.
Tokenflood uses litellm and supports all providers covered by litellm.
Caution: High costs can occur if misconfigured with pay-per-token services; ensure workloads stay within budget.
Common usage scenarios include load testing self-hosted LLMs, assessing hardware/quantization effects, and evaluating hosted LLM providers.
Example graphs show latency impacts of changing prompt parameters (e.g., increasing prefix tokens or reducing output tokens).
Tokenflood provides heuristic load testing without needing specific prompt data, using token sets to generate inputs.
Safety measures include token usage estimation, budget limits, error rate monitoring, and warm-up request checks.
Installation involves pip and vllm setup, with quick start configs for initial runs.
Endpoint specifications define the target (provider, model, base_url, etc.), supporting various LLM providers.
Run suites define test phases with request rates, load types, and token budgets for controlled testing.
Contributions are welcome, with guidelines for forking, testing, and submitting pull requests.

Hasty Briefsbeta