Hasty Briefsbeta

Bilingual

Building Jetflow: a framework for performant data pipelines at Cloudflare

9 months ago
  • #big-data
  • #data-engineering
  • #cloudflare
  • Cloudflare's Business Intelligence team manages a petabyte-scale data lake, ingesting 141 billion rows daily from various sources.
  • Existing ELT solutions couldn't meet Cloudflare's growing data needs, leading to the development of Jetflow, a custom framework.
  • Jetflow achieved over 100x efficiency improvement in GB-s, reducing job times from 48 hours to 5.5 hours with less memory usage.
  • Performance improved by over 10x, with ingestion rates jumping from 60-80,000 rows per second to 2-5 million per database connection.
  • Jetflow's modular design supports extensibility, working with ClickHouse, Postgres, Kafka, SaaS APIs, and Google BigQuery among others.
  • Key requirements for Jetflow included performance, backwards compatibility, ease of use, customizability, and testability.
  • The framework breaks down pipelines into Consumers, Transformers, and Loaders, configurable via YAML for flexibility and ease of use.
  • Data is divided into RunInstance, Partition, and Batch for idempotent processing and efficient parallelization.
  • Jetflow uses Arrow as an internal data format for compatibility, efficiency, and minimal serialization overhead.
  • Optimizations include reading data in columnar formats to avoid unnecessary row-to-column conversions, improving performance.
  • Case studies on ClickHouse and Postgres highlight significant performance gains through optimized drivers and direct data handling.
  • As of early July 2025, Jetflow ingests 77 billion records daily, with plans to migrate all jobs to reach 141 billion records.
  • Future plans include open-sourcing Jetflow and expanding the team to further develop such tools.