Hasty Briefsbeta

Bilingual

We recovered from nightmare Postgres corruption on the matrix.org homeserver

9 months ago
  • #Matrix.org
  • #PostgreSQL
  • #DatabaseCorruption
  • Matrix.org homeserver experienced issues where rooms stopped working, with operations like sending messages or joining rooms failing.
  • The problem was traced to corrupted parts of an index in a large PostgreSQL database, affecting state groups which are crucial for room operations.
  • A background maintenance task incorrectly removed active data due to the corrupt index, leading to room corruption.
  • After identifying the issue, the team rebuilt the corrupted index and restored data from backups, repairing most affected rooms.
  • The corruption likely occurred at least a year prior but only became noticeable when the maintenance task interacted with the corrupted index.
  • Investigations revealed that the corruption was extensive, affecting millions of state groups, but was confined to a specific range.
  • The root cause remains unclear, with possibilities including hardware failures, kernel or disk firmware bugs, but no definitive answer was found.
  • The incident highlighted the challenges of maintaining large-scale services and the importance of robust database management and backup strategies.
  • The Matrix.org Foundation, which relies on donations, plays a critical role in maintaining the Matrix ecosystem and ensuring digital privacy and dignity.