Intelligent Kubernetes Load Balancing at Databricks

7 hours ago

Copy Link

Databricks uses Kubernetes for internal systems, but default networking like ClusterIP services and kube-proxy have performance limitations.
Built a client-side load balancing system to improve traffic distribution, reduce tail latencies, and enhance service-to-service communication resilience.
High-performance gRPC communication in Kubernetes faces challenges with persistent HTTP/2 connections.
Default load balancing fails in high-throughput, latency-sensitive environments due to traffic skew, DNS dependency, and lack of per-request decisions.
Developed a proxyless, client-driven load balancing system with a custom service discovery control plane for Layer 7 protocols.
Key advantages include per-request decisions, reduced tail latency, efficient resource use, and minimal DNS dependency.
Implemented strategies like Power of Two Choices (P2C), zone-aware routing, and least-loaded routing.
Control plane monitors Kubernetes API for service changes, providing live endpoint metadata for intelligent routing.
Deployment led to reduced tail latency, lower error rates, and better resource utilization.
Challenges included operational complexity, client library adoption, and debugging.
Evaluated alternatives like headless services and Istio but found them unsuitable for Databricks' scale and needs.
Future enhancements include cross-cluster/region load balancing and advanced strategies for AI workloads.

Hasty Briefsbeta