Scaling opencomputer from 1 VM to 1 million sandboxes
8 hours ago
- #cloud-computing
- #virtualization
- #scalability
- OpenComputer initially faced scaling limitations due to Azure's regional CPU quotas, starting with a 300-CPU ceiling in a busy data center.
- To scale beyond single-region constraints, the architecture was redesigned into 'cells'—self-contained units for VM orchestration that can deploy across any cloud region independently.
- Each cell includes a stripped-down control plane managing VM lifecycle (scheduling, hibernation, migration) and 5–10 workers, with no dependencies on external components like billing or UI.
- A global registry at the edge, built on Cloudflare Workers and D1 database, routes sandbox creation requests to the cell with the most capacity, ensuring low-latency placement.
- Post-creation, all frequent operations (exec, file I/O, PTY) are handled directly by the cell's control plane using signed JWTs, avoiding edge lookups.
- Real-time billing is enabled by heartbeats—VMs report activity every 10 seconds via an event stream to Durable Objects, allowing per-second billing and immediate registry updates.
- Performance optimizations reduced QEMU VM boot times to under 1 second at p95, hibernation to ~6 seconds, and wake times to 1–2 seconds, depending on S3 checkpoint warmth.
- The system now scales across multiple cloud providers (AWS, Azure, GCP, OCI) by adding cells as needed, effectively decoupling capacity from any single region's quota.
- The entire platform is open source, allowing inspection of the scheduler and event pipeline for transparency and community contributions.