Matrix: Post-mortem of the September 2 outage
6 months ago
- #database-outage
- #postgresql
- #disaster-recovery
- Matrix.org homeserver experienced a 24-hour outage due to a failed database during routine maintenance.
- Attempts to restore the primary database led to losing the secondary, requiring a lengthy restore from 51TB S3 backups.
- No data was lost, but the outage lasted from 2025-09-02 17:45 UTC to 2025-09-03 18:00 UTC.
- The incident highlighted issues with database server naming conventions, backup strategies, and recovery processes.
- Lessons learned include the need for better safeguards, improved tools, and community communication during outages.