Parquet Content-Defined Chunking
a day ago
- #Parquet
- #Hugging Face
- #Data Deduplication
- Parquet Content-Defined Chunking (CDC) is now available in PyArrow and Pandas, enabling efficient deduplication of Parquet files on content-addressable storage systems like Hugging Face's Xet storage layer.
- CDC dramatically reduces data transfer and storage costs by uploading or downloading only the changed data chunks.
- Enable CDC by passing the `use_content_defined_chunking` argument in PyArrow or Pandas functions.
- Hugging Face hosts nearly 21 PB of datasets, with Parquet files accounting for over 4 PB, making optimization a priority.
- Xet storage layer leverages content-defined chunking to efficiently deduplicate chunks of data, reducing storage costs and improving download/upload speeds.
- Parquet's layout and column-chunk based compression can produce different byte-level representations for minor changes, leading to suboptimal deduplication performance.
- Content-defined chunking minimizes byte-level differences between similar data, improving deduplication performance.
- Demonstrated use cases include re-uploading exact copies, adding/removing columns, changing column types, appending new rows, inserting/deleting rows, varying row-group sizes, and file-level splits.
- Parquet CDC combined with Xet storage layer efficiently deduplicates data across multiple files even if data is split at different boundaries.
- Pandas also supports the Parquet CDC feature, enabling efficient deduplication when filtering and uploading datasets.