Hasty Briefsbeta

Show HN: Tokenflood – simulate arbitrary loads on instruction-tuned LLMs

11 days ago
  • #LLM
  • #load-testing
  • #performance
  • Tokenflood is a load testing tool for instruction-tuned LLMs, enabling arbitrary load profiles without specific prompt/response data.
  • It allows defining prompt lengths, prefix lengths, output lengths, and request rates to simulate workloads.
  • Tokenflood uses litellm and supports all providers covered by litellm.
  • Caution: High costs can occur if misconfigured with pay-per-token services; ensure workloads stay within budget.
  • Common usage scenarios include load testing self-hosted LLMs, assessing hardware/quantization effects, and evaluating hosted LLM providers.
  • Example graphs show latency impacts of changing prompt parameters (e.g., increasing prefix tokens or reducing output tokens).
  • Tokenflood provides heuristic load testing without needing specific prompt data, using token sets to generate inputs.
  • Safety measures include token usage estimation, budget limits, error rate monitoring, and warm-up request checks.
  • Installation involves pip and vllm setup, with quick start configs for initial runs.
  • Endpoint specifications define the target (provider, model, base_url, etc.), supporting various LLM providers.
  • Run suites define test phases with request rates, load types, and token budgets for controlled testing.
  • Contributions are welcome, with guidelines for forking, testing, and submitting pull requests.