Intelligent Kubernetes Load Balancing at Databricks
7 hours ago
- #gRPC
- #Load Balancing
- #Kubernetes
- Databricks uses Kubernetes for internal systems, but default networking like ClusterIP services and kube-proxy have performance limitations.
- Built a client-side load balancing system to improve traffic distribution, reduce tail latencies, and enhance service-to-service communication resilience.
- High-performance gRPC communication in Kubernetes faces challenges with persistent HTTP/2 connections.
- Default load balancing fails in high-throughput, latency-sensitive environments due to traffic skew, DNS dependency, and lack of per-request decisions.
- Developed a proxyless, client-driven load balancing system with a custom service discovery control plane for Layer 7 protocols.
- Key advantages include per-request decisions, reduced tail latency, efficient resource use, and minimal DNS dependency.
- Implemented strategies like Power of Two Choices (P2C), zone-aware routing, and least-loaded routing.
- Control plane monitors Kubernetes API for service changes, providing live endpoint metadata for intelligent routing.
- Deployment led to reduced tail latency, lower error rates, and better resource utilization.
- Challenges included operational complexity, client library adoption, and debugging.
- Evaluated alternatives like headless services and Istio but found them unsuitable for Databricks' scale and needs.
- Future enhancements include cross-cluster/region load balancing and advanced strategies for AI workloads.