Postgres data stored in Parquet on S3: LTAP architecture explained
3 days ago
- #OLTP Innovation
- #Database Architecture
- #Cloud Storage
- The author started with the belief that OLTP databases were solved, but building Databricks revealed they were clunky and fragile, leading to the creation of Lakebase.
- Traditional monolithic databases like Postgres use a write-ahead log (WAL) for fast writes and data files for reads, but this architecture causes issues like data loss, scaling difficulties, and workload interference.
- Lakebase addresses these by externalizing storage: the WAL goes to a distributed SafeKeeper service for durability via Paxos replication, and data files go to a PageServer service that materializes pages into cloud object storage.
- This stateless compute architecture enables unlimited storage, serverless elastic compute, durable writes with zero data loss, simpler high availability, and instant branching/cloning without physical copying.
- Lakebase evolves into LTAP (Lake Transactional/Analytical Processing), where data is stored in open columnar formats (e.g., Parquet) in a single copy, accessible by both Postgres for transactions and Lakehouse engines for analytics, eliminating CDC or mirroring delays.
- LTAP ensures freshness by having analytical queries read from object storage using a log sequence number (LSN) from Postgres, merging recent changes from the PageServer without impacting transactional performance.
- Unlike HTAP systems that try to unify workloads in one engine and face feature, ecosystem, and isolation issues, LTAP unifies at the storage layer while using specialized engines, offering better performance and compatibility.