The Equality Delete Problem in Apache Iceberg
12 days ago
- #Data Infrastructure
- #CDC
- #Apache Iceberg
- Apache Iceberg is a hot topic in data infrastructure, with major acquisitions by Databricks and Snowflake highlighting its importance.
- Streaming data from Postgres to Iceberg in real-time is complex, especially with CDC (Change Data Capture) systems like Debezium.
- Iceberg supports two types of deletes: position delete (by file path and row number) and equality delete (by column values, typically primary keys).
- Equality delete is the only viable option for streaming CDC scenarios, but it comes with query performance trade-offs.
- Major query engines like Snowflake, Databricks, and Redshift have limited or no support for equality deletes, complicating CDC ingestion.
- RisingWave provides an end-to-end solution for streaming CDC into Iceberg, optimizing for high-frequency updates and deletes.
- RisingWave uses a hybrid approach: position deletes for in-batch updates and equality deletes for out-of-batch changes.
- Compaction in RisingWave reduces read amplification by periodically merging equality delete files into data files.
- RisingWave ensures cross-engine compatibility by producing 'clean' versions of data for engines that don't support equality deletes.
- Siemens successfully adopted RisingWave to streamline their CDC-to-Iceberg pipeline, reducing latency and operational overhead.