Netflix Simplified Batch Compute with Kueue
5 days ago
- #Kubernetes
- #Cloud Infrastructure
- #Batch Processing
- Netflix transitioned from its custom managed batch solution (CMB) to the Kubernetes-native Kueue for batch workload queuing and scheduling.
- CMB managed batch jobs using a tenant hierarchy with reserved and shared capacity, but lacked features like preemption and faced development challenges.
- Kueue was chosen due to its compatibility with Titus scheduling, multi-tenant quota support, and native features like preemption and fair sharing.
- The migration to Netflix Batch (using Kueue) was transparent to users, maintained throughput, and involved converting tenants to Kueue primitives like ClusterQueue and LocalQueue.
- Key lessons included maintaining API parity, migrating complex use cases early, and adjusting Kueue configurations for high throughput.
- Kueue is now fully productionized at Netflix, managing millions of batch jobs, improving resource utilization, and enabling fair sharing with preemption.
- Future plans include enrolling more Titus workloads and leveraging Kueue for internal Kubernetes-native training infrastructure.