Hasty Briefsbeta

Bilingual

Hunting a 16-year-old SQLite WAL bug with TLA+

3 days ago
  • #Database Corruption
  • #TLA+
  • #SQLite
  • SQLite recently fixed a 16-year-old bug in its Write Ahead Log (WAL) checkpointing mechanism that could cause database corruption.
  • The bug involved a race condition between appending new pages to the WAL and checkpointing, where a writer could reset the WAL while a checkpoint was in progress, leading to data loss.
  • Using TLA+, the dqlite team modeled SQLite's behavior to understand the bug's sequence and then created a separate model for dqlite to check if it was affected.
  • Dqlite's design prevents the bug because it uses stricter locking: checkpoints are stop-the-world operations that take the write lock, eliminating the concurrency that causes the race condition in SQLite.
  • SQLite's fix adds a check during checkpointing to verify that the WAL's salt value hasn't changed since the checkpoint started, preventing the data race.