DuckDB Internals: Why Is DuckDB Fast? (Part 1)
3 days ago
- #OpenSource
- #Analytics
- #Database
- DuckDB evolved from a research project in 2019 to a widely adopted in-process analytical SQL database.
- It is used in various applications including notebooks, ETL pipelines, dashboards, and embedded analytics.
- Companies like MotherDuck, Hex, Omni, Evidence, Fivetran, and Rill build products around DuckDB.
- DuckDB is fast and easy to use, with a single binary under 20 MB and no external dependencies.
- It allows direct querying of files like Parquet, CSV, and JSON without needing to create tables.
- DuckDB avoids client-server overhead by being a library, enabling zero-copy data access with Arrow format.
- Query processing involves parsing, binding, optimization with ~30 passes, and physical planning.
- Optimizations include filter pushdown, subquery unnesting, join ordering, and runtime filter generation.
- Execution uses vectorized processing and morsel-driven parallelism with pipelines and sinks.
- Storage uses a single-file columnar format with blocks, checksums, row groups, and zone maps.
- DuckDB can query Parquet files efficiently by leveraging their columnar structure and statistics.
- CSV files are handled with a sniffer that detects dialect, column types, and headers automatically.