Hasty Briefsbeta

Intelligent Kubernetes Load Balancing at Databricks

8 hours ago
  • #gRPC
  • #Load Balancing
  • #Kubernetes
  • Databricks uses Kubernetes for internal systems, but default networking like ClusterIP services and kube-proxy have performance limitations.
  • Built a client-side load balancing system to improve traffic distribution, reduce tail latencies, and enhance service-to-service communication resilience.
  • High-performance gRPC communication in Kubernetes faces challenges with persistent HTTP/2 connections.
  • Default load balancing fails in high-throughput, latency-sensitive environments due to traffic skew, DNS dependency, and lack of per-request decisions.
  • Developed a proxyless, client-driven load balancing system with a custom service discovery control plane for Layer 7 protocols.
  • Key advantages include per-request decisions, reduced tail latency, efficient resource use, and minimal DNS dependency.
  • Implemented strategies like Power of Two Choices (P2C), zone-aware routing, and least-loaded routing.
  • Control plane monitors Kubernetes API for service changes, providing live endpoint metadata for intelligent routing.
  • Deployment led to reduced tail latency, lower error rates, and better resource utilization.
  • Challenges included operational complexity, client library adoption, and debugging.
  • Evaluated alternatives like headless services and Istio but found them unsuitable for Databricks' scale and needs.
  • Future enhancements include cross-cluster/region load balancing and advanced strategies for AI workloads.