Hasty Briefsbeta

The Equality Delete Problem in Apache Iceberg

12 days ago
  • #Data Infrastructure
  • #CDC
  • #Apache Iceberg
  • Apache Iceberg is a hot topic in data infrastructure, with major acquisitions by Databricks and Snowflake highlighting its importance.
  • Streaming data from Postgres to Iceberg in real-time is complex, especially with CDC (Change Data Capture) systems like Debezium.
  • Iceberg supports two types of deletes: position delete (by file path and row number) and equality delete (by column values, typically primary keys).
  • Equality delete is the only viable option for streaming CDC scenarios, but it comes with query performance trade-offs.
  • Major query engines like Snowflake, Databricks, and Redshift have limited or no support for equality deletes, complicating CDC ingestion.
  • RisingWave provides an end-to-end solution for streaming CDC into Iceberg, optimizing for high-frequency updates and deletes.
  • RisingWave uses a hybrid approach: position deletes for in-batch updates and equality deletes for out-of-batch changes.
  • Compaction in RisingWave reduces read amplification by periodically merging equality delete files into data files.
  • RisingWave ensures cross-engine compatibility by producing 'clean' versions of data for engines that don't support equality deletes.
  • Siemens successfully adopted RisingWave to streamline their CDC-to-Iceberg pipeline, reducing latency and operational overhead.