Bauplan – Git-for-data pipelines on object storage
a year ago
- #python
- #serverless
- #data-platform
- Bauplan is a Pythonic data platform for large-scale data pipelines and git-for-data over S3 data lakes.
- It allows running ML workflows, AI applications, and data transformation pipelines without managing infrastructure.
- Built by ML and data engineers to simplify cloud infrastructure management.
- Simple: Write pipelines as Python functions without containerization or Spark.
- Robust: Features Git-for-data and Refs for versioning, reproducibility, and auditability.
- Pythonic by design: No DSLs, YAML, or Spark required.
- Work with tables in S3: Convert Parquet/CSV to Iceberg tables with ACID transactions.
- Git-for-data: Create zero-copy branches for safe collaboration.
- Serverless pipelines: Run stateless Python functions in the cloud.
- SQL everywhere: Run queries across branches and tables in S3.
- CI/CD for data: Automate testing and deployment of pipelines.
- Version and reproduce with Refs: Track pipeline runs for reproducibility and audits.