Hasty Briefsbeta

Parquet Content-Defined Chunking

a day ago
  • #Parquet
  • #Hugging Face
  • #Data Deduplication
  • Parquet Content-Defined Chunking (CDC) is now available in PyArrow and Pandas, enabling efficient deduplication of Parquet files on content-addressable storage systems like Hugging Face's Xet storage layer.
  • CDC dramatically reduces data transfer and storage costs by uploading or downloading only the changed data chunks.
  • Enable CDC by passing the `use_content_defined_chunking` argument in PyArrow or Pandas functions.
  • Hugging Face hosts nearly 21 PB of datasets, with Parquet files accounting for over 4 PB, making optimization a priority.
  • Xet storage layer leverages content-defined chunking to efficiently deduplicate chunks of data, reducing storage costs and improving download/upload speeds.
  • Parquet's layout and column-chunk based compression can produce different byte-level representations for minor changes, leading to suboptimal deduplication performance.
  • Content-defined chunking minimizes byte-level differences between similar data, improving deduplication performance.
  • Demonstrated use cases include re-uploading exact copies, adding/removing columns, changing column types, appending new rows, inserting/deleting rows, varying row-group sizes, and file-level splits.
  • Parquet CDC combined with Xet storage layer efficiently deduplicates data across multiple files even if data is split at different boundaries.
  • Pandas also supports the Parquet CDC feature, enabling efficient deduplication when filtering and uploading datasets.