Lance v2: A columnar container format for modern data (2024)
9 months ago
- #data-format
- #AI-ML
- #performance
- Lance v2 is introduced to address inefficiencies in AI/ML workloads that existing formats like Parquet struggle with.
- Key use cases for Lance v2 include point lookups, wide columns, very wide schemas, flexible encodings, and flexible metadata.
- Lance v2 eliminates row groups, optimizing for ideal page sizes and decoupling I/O and compute for better performance.
- The format allows columns to be of different lengths and supports writing data 'array at a time' or 'batch at a time'.
- Lance v2 treats encodings as extensions, making it easy to add new encodings without modifying the file format.
- The format does not enforce a type system, keeping the specification simple and avoiding ecosystem fragmentation.
- Flexibility in data placement (page buffer, column buffer, file buffer) enables new use cases beyond traditional tabular data.
- Statistics in Lance v2 are part of the encoding process, allowing for various forms like zone maps and bloom filters.
- The initial implementation of Lance v2 is available, with performance on par with the best Parquet readers.
- Community help is sought for additional use cases, benchmarks, integration tests, and ecosystem integrations.