Building Jetflow: a framework for performant data pipelines at Cloudflare
9 months ago
- #big-data
- #data-engineering
- #cloudflare
- Cloudflare's Business Intelligence team manages a petabyte-scale data lake, ingesting 141 billion rows daily from various sources.
- Existing ELT solutions couldn't meet Cloudflare's growing data needs, leading to the development of Jetflow, a custom framework.
- Jetflow achieved over 100x efficiency improvement in GB-s, reducing job times from 48 hours to 5.5 hours with less memory usage.
- Performance improved by over 10x, with ingestion rates jumping from 60-80,000 rows per second to 2-5 million per database connection.
- Jetflow's modular design supports extensibility, working with ClickHouse, Postgres, Kafka, SaaS APIs, and Google BigQuery among others.
- Key requirements for Jetflow included performance, backwards compatibility, ease of use, customizability, and testability.
- The framework breaks down pipelines into Consumers, Transformers, and Loaders, configurable via YAML for flexibility and ease of use.
- Data is divided into RunInstance, Partition, and Batch for idempotent processing and efficient parallelization.
- Jetflow uses Arrow as an internal data format for compatibility, efficiency, and minimal serialization overhead.
- Optimizations include reading data in columnar formats to avoid unnecessary row-to-column conversions, improving performance.
- Case studies on ClickHouse and Postgres highlight significant performance gains through optimized drivers and direct data handling.
- As of early July 2025, Jetflow ingests 77 billion records daily, with plans to migrate all jobs to reach 141 billion records.
- Future plans include open-sourcing Jetflow and expanding the team to further develop such tools.