Hasty Briefsbeta

Bilingual

Netflix Simplified Batch Compute with Kueue

5 days ago
  • #Kubernetes
  • #Cloud Infrastructure
  • #Batch Processing
  • Netflix transitioned from its custom managed batch solution (CMB) to the Kubernetes-native Kueue for batch workload queuing and scheduling.
  • CMB managed batch jobs using a tenant hierarchy with reserved and shared capacity, but lacked features like preemption and faced development challenges.
  • Kueue was chosen due to its compatibility with Titus scheduling, multi-tenant quota support, and native features like preemption and fair sharing.
  • The migration to Netflix Batch (using Kueue) was transparent to users, maintained throughput, and involved converting tenants to Kueue primitives like ClusterQueue and LocalQueue.
  • Key lessons included maintaining API parity, migrating complex use cases early, and adjusting Kueue configurations for high throughput.
  • Kueue is now fully productionized at Netflix, managing millions of batch jobs, improving resource utilization, and enabling fair sharing with preemption.
  • Future plans include enrolling more Titus workloads and leveraging Kueue for internal Kubernetes-native training infrastructure.