Hasty Briefsbeta

Bilingual

Pg_lake: Postgres with Iceberg and data lake access

5 months ago
  • #Data Lake
  • #PostgreSQL
  • #Iceberg
  • pg_lake integrates Iceberg and data lake files into Postgres, enabling it to function as a standalone lakehouse system.
  • Supports transactions and fast queries on Iceberg tables, and works directly with raw data files in object stores like S3.
  • Allows creating and modifying Iceberg tables from PostgreSQL with transactional guarantees.
  • Enables querying and importing data from Parquet, CSV, JSON, and Iceberg files stored in S3 or compatible object stores.
  • Supports exporting query results back to S3 in Parquet, CSV, or JSON formats using COPY commands.
  • Reads geospatial formats like GeoJSON and Shapefiles, and supports compression with .gz and .zst.
  • Features a built-in map type for semi-structured or key-value data.
  • Combines heap, Iceberg, and external files in the same SQL queries with full transactional guarantees.
  • Infer table columns and types from external data sources like Iceberg, Parquet, JSON, and CSV files.
  • Leverages DuckDB’s query engine for fast execution within Postgres.
  • Two setup methods: Docker for easy testing and building from source for manual setup or development.
  • Includes PostgreSQL extensions, pgduck_server application, and S3-compatible storage setup.
  • pgduck_server is a standalone process using DuckDB to execute queries, accessible via psql on port 5332.
  • Supports setting memory limits, init file paths, and cache directories for pgduck_server.
  • Relies on DuckDB secrets manager for credentials, with support for AWS and GCP.
  • Allows creating Iceberg tables with 'USING iceberg' and querying them directly.
  • Supports COPY commands for importing/exporting data in Parquet, CSV, or JSON formats.
  • Modular design with components like pg_lake_iceberg, pg_lake_table, pg_lake_copy, and pg_lake_engine.
  • Developed by Crunchy Data, later acquired by Snowflake, and open-sourced as pg_lake in 2025.
  • Dependent on third-party projects Apache Avro and DuckDB, with patches applied during build.