Building Jetflow: a framework for performant data pipelines at Cloudflare

9 months ago

#big-data
#data-engineering
#cloudflare

Cloudflare's Business Intelligence team manages a petabyte-scale data lake, ingesting 141 billion rows daily from various sources.
Existing ELT solutions couldn't meet Cloudflare's growing data needs, leading to the development of Jetflow, a custom framework.
Jetflow achieved over 100x efficiency improvement in GB-s, reducing job times from 48 hours to 5.5 hours with less memory usage.
Performance improved by over 10x, with ingestion rates jumping from 60-80,000 rows per second to 2-5 million per database connection.
Jetflow's modular design supports extensibility, working with ClickHouse, Postgres, Kafka, SaaS APIs, and Google BigQuery among others.
Key requirements for Jetflow included performance, backwards compatibility, ease of use, customizability, and testability.
The framework breaks down pipelines into Consumers, Transformers, and Loaders, configurable via YAML for flexibility and ease of use.
Data is divided into RunInstance, Partition, and Batch for idempotent processing and efficient parallelization.
Jetflow uses Arrow as an internal data format for compatibility, efficiency, and minimal serialization overhead.
Optimizations include reading data in columnar formats to avoid unnecessary row-to-column conversions, improving performance.
Case studies on ClickHouse and Postgres highlight significant performance gains through optimized drivers and direct data handling.
As of early July 2025, Jetflow ingests 77 billion records daily, with plans to migrate all jobs to reach 141 billion records.
Future plans include open-sourcing Jetflow and expanding the team to further develop such tools.

Hasty Briefsbeta

Building Jetflow: a framework for performant data pipelines at Cloudflare