Building the largest known Kubernetes cluster, with 130k nodes
a day ago
- #AI Workloads
- #Scalability
- #Kubernetes
- Google Kubernetes Engine (GKE) successfully ran a 130,000-node cluster in experimental mode, doubling the previous limit.
- Scaling involves not just nodes but also Pod creation, scheduling throughput, and distributed storage, sustaining 1,000 Pods per second.
- AI workloads are driving demand for mega-clusters, with power constraints shifting focus to multi-cluster solutions like MultiKueue.
- Key innovations include optimized read scalability with Consistent Reads from Cache and Snapshottable API Server Cache.
- A proprietary key-value store based on Google’s Spanner database supports massive scale with 13,000 QPS for lease updates.
- Kueue provides advanced job queueing, enabling workload prioritization and 'all-or-nothing' scheduling for AI/ML environments.
- Future scheduling enhancements aim for workload-aware scheduling, moving from Pod-centric to workload-centric approaches.
- GCS FUSE and Google Cloud Managed Lustre offer scalable, high-throughput data access for AI workloads.
- A four-phase benchmark validated GKE’s performance, showing efficient preemption, scheduling, and elasticity under extreme loads.
- GKE demonstrated stability with low latency, high throughput (1,000 Pods/s), and over 1 million objects in the database.