DuckDB Internals: Why Is DuckDB Fast? (Part 1)

3 days ago

DuckDB evolved from a research project in 2019 to a widely adopted in-process analytical SQL database.
It is used in various applications including notebooks, ETL pipelines, dashboards, and embedded analytics.
Companies like MotherDuck, Hex, Omni, Evidence, Fivetran, and Rill build products around DuckDB.
DuckDB is fast and easy to use, with a single binary under 20 MB and no external dependencies.
It allows direct querying of files like Parquet, CSV, and JSON without needing to create tables.
DuckDB avoids client-server overhead by being a library, enabling zero-copy data access with Arrow format.
Query processing involves parsing, binding, optimization with ~30 passes, and physical planning.
Optimizations include filter pushdown, subquery unnesting, join ordering, and runtime filter generation.
Execution uses vectorized processing and morsel-driven parallelism with pipelines and sinks.
Storage uses a single-file columnar format with blocks, checksums, row groups, and zone maps.
DuckDB can query Parquet files efficiently by leveraging their columnar structure and statistics.
CSV files are handled with a sniffer that detects dialect, column types, and headers automatically.

Hasty Briefsbeta