Hunting a 16-year-old SQLite WAL bug with TLA+
3 days ago
- #Database Corruption
- #TLA+
- #SQLite
- SQLite recently fixed a 16-year-old bug in its Write Ahead Log (WAL) checkpointing mechanism that could cause database corruption.
- The bug involved a race condition between appending new pages to the WAL and checkpointing, where a writer could reset the WAL while a checkpoint was in progress, leading to data loss.
- Using TLA+, the dqlite team modeled SQLite's behavior to understand the bug's sequence and then created a separate model for dqlite to check if it was affected.
- Dqlite's design prevents the bug because it uses stricter locking: checkpoints are stop-the-world operations that take the write lock, eliminating the concurrency that causes the race condition in SQLite.
- SQLite's fix adds a check during checkpointing to verify that the WAL's salt value hasn't changed since the checkpoint started, preventing the data race.