Hasty Briefsbeta

Bilingual

DuckDB Internals: Why Is DuckDB Fast? (Part 1)

3 days ago
  • #OpenSource
  • #Analytics
  • #Database
  • DuckDB evolved from a research project in 2019 to a widely adopted in-process analytical SQL database.
  • It is used in various applications including notebooks, ETL pipelines, dashboards, and embedded analytics.
  • Companies like MotherDuck, Hex, Omni, Evidence, Fivetran, and Rill build products around DuckDB.
  • DuckDB is fast and easy to use, with a single binary under 20 MB and no external dependencies.
  • It allows direct querying of files like Parquet, CSV, and JSON without needing to create tables.
  • DuckDB avoids client-server overhead by being a library, enabling zero-copy data access with Arrow format.
  • Query processing involves parsing, binding, optimization with ~30 passes, and physical planning.
  • Optimizations include filter pushdown, subquery unnesting, join ordering, and runtime filter generation.
  • Execution uses vectorized processing and morsel-driven parallelism with pipelines and sinks.
  • Storage uses a single-file columnar format with blocks, checksums, row groups, and zone maps.
  • DuckDB can query Parquet files efficiently by leveraging their columnar structure and statistics.
  • CSV files are handled with a sniffer that detects dialect, column types, and headers automatically.