Client-Side Load Balancing at a Million Requests per Second
5 hours ago
- #distributed systems
- #performance optimization
- #client-side load balancing
- Built an in-process client-side load balancer for a million requests per second to replace internal fan-out traffic through shared edge ingress, eliminating latency uncertainty and reducing costs.
- Achieved hash parity with Skipper by implementing the same xxHash64 virtual-node ring to maintain cache locality, ensuring identical routing during migration and preventing cache fragmentation.
- Used a Kubernetes watch-based informer for endpoint discovery instead of polling to avoid control plane overload, with a debounce mechanism to handle scale events smoothly.
- Rolled out safely with runtime toggles and percentage-based traffic ramping, immediately reducing latency and scaling down Skipper's fleet from over 50 pods to 8, saving over $1,000 per day.
- Implemented N-ring fade-in for scale-up events with independent fading windows, ensuring pods warm up with the correct cache and preventing cold-cache spikes to DynamoDB.
- Replaced in-flight request count with occupancy (seconds of work per second) as the load signal for bounded load, combined with latency weighting, to better identify genuinely busy pods and redistribute load effectively.
- Added a capped walk of up to 10 hops in bounded load to prevent ring-wide stampedes during transient network issues, maintaining cache locality and system stability.
- Hardened the fan-out path with fast retries, a FIFO buffer, and enhanced logging, enabling resilience to node-level network freezes and reducing incident frequency.
- Paused availability-zone-aware routing due to edge cases and complexity, planning to resume with safeguards after ensuring team readiness, as it involves trade-offs between cache locality and cost savings.
- Emphasized the importance of owning routing telemetry for visibility, maintaining a fast deployment pipeline for safety, and considering client-side load balancing only for extreme edge cases due to ongoing ownership costs.