Pg_lake: Postgres with Iceberg and data lake access

6 months ago

#Data Lake
#PostgreSQL
#Iceberg

pg_lake integrates Iceberg and data lake files into Postgres, enabling it to function as a standalone lakehouse system.
Supports transactions and fast queries on Iceberg tables, and works directly with raw data files in object stores like S3.
Allows creating and modifying Iceberg tables from PostgreSQL with transactional guarantees.
Enables querying and importing data from Parquet, CSV, JSON, and Iceberg files stored in S3 or compatible object stores.
Supports exporting query results back to S3 in Parquet, CSV, or JSON formats using COPY commands.
Reads geospatial formats like GeoJSON and Shapefiles, and supports compression with .gz and .zst.
Features a built-in map type for semi-structured or key-value data.
Combines heap, Iceberg, and external files in the same SQL queries with full transactional guarantees.
Infer table columns and types from external data sources like Iceberg, Parquet, JSON, and CSV files.
Leverages DuckDB’s query engine for fast execution within Postgres.
Two setup methods: Docker for easy testing and building from source for manual setup or development.
Includes PostgreSQL extensions, pgduck_server application, and S3-compatible storage setup.
pgduck_server is a standalone process using DuckDB to execute queries, accessible via psql on port 5332.
Supports setting memory limits, init file paths, and cache directories for pgduck_server.
Relies on DuckDB secrets manager for credentials, with support for AWS and GCP.
Allows creating Iceberg tables with 'USING iceberg' and querying them directly.
Supports COPY commands for importing/exporting data in Parquet, CSV, or JSON formats.
Modular design with components like pg_lake_iceberg, pg_lake_table, pg_lake_copy, and pg_lake_engine.
Developed by Crunchy Data, later acquired by Snowflake, and open-sourced as pg_lake in 2025.
Dependent on third-party projects Apache Avro and DuckDB, with patches applied during build.

Hasty Briefsbeta

Pg_lake: Postgres with Iceberg and data lake access