The noisy neighbor problem: serving LLMs
2 days ago
- #GPU resource management
- #multi-tenant scheduling
- #fairness algorithms
- The noisy-neighbor problem in multi-tenant LLM platforms arises when one tenant's burst of requests causes latency for others due to inefficient batching in a single global queue.
- Cohere's solution uses a layered approach: Rate Limiter for admission control, Performance Tier for SLA-based prioritization, Deficit Round Robin (DRR) for fair scheduling across tenants, and Priority for ordering within a tenant's queue.
- DRR employs request-based or token-based budgeting to measure fairness: request-based treats each request equally, while token-based charges based on token count, better reflecting resource use.
- Priority selector ensures that within a tenant's queue, requests are ordered by priority, deadline, and arrival time, maintaining urgency and predictability.
- The integrated system allows for fair, burst-proof sharing of GPU resources while honoring commercial tiers and preventing overload, improving inference latency and efficiency for all customers.