Disk can lie to you when you write to it
2 days ago
- #database
- #WAL
- #durability
- A write-ahead log (WAL) is essential for database durability, but disks can fail silently.
- Common issues include the page cache problem, disks lying about success, write ordering chaos, and single points of failure.
- Five layers of defense for a robust WAL: checksums, dual WAL files, O_DIRECT + O_DSYNC, linked I/O ordering, and post-fsync verification reads.
- Checksums (CRC32C) detect silent data corruption from hardware or firmware errors.
- Dual WAL files protect against latent sector errors (LSEs) by maintaining redundant copies.
- O_DIRECT and O_DSYNC ensure data is written directly to disk, bypassing the kernel's page cache.
- Linked I/O ordering (io_uring in Linux) guarantees write and fsync operations complete in the correct sequence.
- Post-fsync verification reads catch silent failures immediately by re-reading and validating written data.
- Recovery involves scanning both WAL files, merging valid records, and replaying operations to restore consistent state.
- Real-world scenarios highlight the importance of these layers, such as silent corruption and page cache surprises.
- A production-grade WAL must include checksums, redundancy, direct writes, operation ordering, and verification to fulfill its durability contract.