Corrosion
6 months ago
- #rust
- #flyio
- #distributed-systems
- Fly.io transforms Docker containers into Fly Machines, micro-VMs running globally on their own hardware.
- State synchronization is the hardest part of their platform, ensuring edge proxies maintain accurate routing tables.
- A major outage occurred on September 1, 2024, due to a Rust concurrency bug causing a system-wide deadlock.
- Distributed systems amplify bugs, as seen with Corrosion, their state distribution system.
- Fly.io's orchestration model differs from mainstream systems by making individual servers the source of truth.
- Corrosion is a Rust-based, gossip protocol-driven system for global routing without distributed consensus.
- Corrosion uses SQLite with CRDT extensions (cr-sqlite) for conflict-free updates and efficient state propagation.
- Past issues with Corrosion include schema changes causing global reconciliation meltdowns and certificate expirations.
- Improvements include watchdog mechanisms, extensive testing, and regionalization to reduce blast radius.
- Corrosion avoids traditional distributed consensus models, presenting as a simple, highly distributed SQLite database.