Vortex – An extensible, state of the art columnar file format
10 days ago
- #data-processing
- #open-source
- #columnar-format
- Vortex is a next-generation columnar file format designed for high-performance data processing.
- It offers 100x faster random access reads, 10-20x faster scans, and 5x faster writes compared to Apache Parquet.
- Features include extensible architecture with pluggable encoding, type system, compression, and layout strategies.
- Vortex is open-source under Apache-2.0 license and governed by the Linux Foundation (LF AI & Data).
- Integrations include Arrow, DataFusion, DuckDB, Spark, Pandas, Polars, and upcoming Apache Iceberg support.
- The file format is stable from version 0.36.0, ensuring backward compatibility.
- Logical and physical layers are strictly separated, with built-in and extension encodings.
- Includes features like zero-copy Arrow integration, extensible encodings, cascading compression, and rich statistics.
- Installation options include Cargo for Rust and UV for Python, with CLI tool 'vx' for file browsing.
- Optimal performance suggested with MiMalloc allocator.
- Security vulnerabilities can be reported to [email protected].
- Vortex acknowledges contributions from academic and open-source communities, including BtrBlocks, FastLanes, FSST, and Apache projects.