Hasty Briefsbeta

Bilingual

Scaling opencomputer from 1 VM to 1 million sandboxes

6 hours ago
  • #cloud-computing
  • #virtualization
  • #scalability
  • OpenComputer initially faced scaling limitations due to Azure's regional CPU quotas, starting with a 300-CPU ceiling in a busy data center.
  • To scale beyond single-region constraints, the architecture was redesigned into 'cells'—self-contained units for VM orchestration that can deploy across any cloud region independently.
  • Each cell includes a stripped-down control plane managing VM lifecycle (scheduling, hibernation, migration) and 5–10 workers, with no dependencies on external components like billing or UI.
  • A global registry at the edge, built on Cloudflare Workers and D1 database, routes sandbox creation requests to the cell with the most capacity, ensuring low-latency placement.
  • Post-creation, all frequent operations (exec, file I/O, PTY) are handled directly by the cell's control plane using signed JWTs, avoiding edge lookups.
  • Real-time billing is enabled by heartbeats—VMs report activity every 10 seconds via an event stream to Durable Objects, allowing per-second billing and immediate registry updates.
  • Performance optimizations reduced QEMU VM boot times to under 1 second at p95, hibernation to ~6 seconds, and wake times to 1–2 seconds, depending on S3 checkpoint warmth.
  • The system now scales across multiple cloud providers (AWS, Azure, GCP, OCI) by adding cells as needed, effectively decoupling capacity from any single region's quota.
  • The entire platform is open source, allowing inspection of the scheduler and event pipeline for transparency and community contributions.