Parquet Content-Defined Chunking

a day ago

Copy Link

Parquet Content-Defined Chunking (CDC) is now available in PyArrow and Pandas, enabling efficient deduplication of Parquet files on content-addressable storage systems like Hugging Face's Xet storage layer.
CDC dramatically reduces data transfer and storage costs by uploading or downloading only the changed data chunks.
Enable CDC by passing the `use_content_defined_chunking` argument in PyArrow or Pandas functions.
Hugging Face hosts nearly 21 PB of datasets, with Parquet files accounting for over 4 PB, making optimization a priority.
Xet storage layer leverages content-defined chunking to efficiently deduplicate chunks of data, reducing storage costs and improving download/upload speeds.
Parquet's layout and column-chunk based compression can produce different byte-level representations for minor changes, leading to suboptimal deduplication performance.
Content-defined chunking minimizes byte-level differences between similar data, improving deduplication performance.
Demonstrated use cases include re-uploading exact copies, adding/removing columns, changing column types, appending new rows, inserting/deleting rows, varying row-group sizes, and file-level splits.
Parquet CDC combined with Xet storage layer efficiently deduplicates data across multiple files even if data is split at different boundaries.
Pandas also supports the Parquet CDC feature, enabling efficient deduplication when filtering and uploading datasets.

Hasty Briefsbeta