Hasty Briefsbeta

Bilingual

The noisy neighbor problem: serving LLMs

2 days ago
  • #GPU resource management
  • #multi-tenant scheduling
  • #fairness algorithms
  • The noisy-neighbor problem in multi-tenant LLM platforms arises when one tenant's burst of requests causes latency for others due to inefficient batching in a single global queue.
  • Cohere's solution uses a layered approach: Rate Limiter for admission control, Performance Tier for SLA-based prioritization, Deficit Round Robin (DRR) for fair scheduling across tenants, and Priority for ordering within a tenant's queue.
  • DRR employs request-based or token-based budgeting to measure fairness: request-based treats each request equally, while token-based charges based on token count, better reflecting resource use.
  • Priority selector ensures that within a tenant's queue, requests are ordered by priority, deadline, and arrival time, maintaining urgency and predictability.
  • The integrated system allows for fair, burst-proof sharing of GPU resources while honoring commercial tiers and preventing overload, improving inference latency and efficiency for all customers.