Hasty Briefsbeta

Jepsen: NATS 2.12.1

3 days ago
  • #NATS
  • #DataDurability
  • #JetStream
  • NATS JetStream uses Raft consensus for replication, promising 'at least once' delivery and total ordering of messages.
  • JetStream's documentation claims both high availability and linearizability, which contradicts the CAP theorem's constraints.
  • Testing revealed vulnerabilities in NATS JetStream, including data loss from file corruption, lazy fsync defaults, and split-brain scenarios.
  • File corruption in .blk or snapshot files can lead to significant data loss or stream deletion, even with minority node corruption.
  • Default fsync intervals of two minutes mean acknowledged writes may not be durable, risking data loss during power failures.
  • A single OS crash combined with network issues can cause persistent split-brain and data loss in JetStream clusters.
  • NATS has addressed some issues in updates, but others like file corruption impacts and lazy fsync risks remain unresolved.
  • Recommendations include changing default fsync to 'always' or clearly documenting the risks of lazy persistence.
  • Future work could explore exactly-once semantics, consumer message order, and safe cluster membership changes.