Hasty Briefsbeta

Why was Apache Kafka created?

a day ago
  • #Kafka
  • #Data Integration
  • #Schemas
  • LinkedIn created Kafka to solve data integration problems, particularly handling site activity data for various uses like fraud detection, ML model training, and website features.
  • The old infrastructure had two main pipelines: a batch pipeline for data warehousing and a real-time pipeline for observability, both requiring manual maintenance and suffering from integration issues.
  • Key problems included schema parsing difficulties, system brittleness, schema evolution challenges, lag in data availability, and separation of operational metrics from activity data.
  • LinkedIn needed a robust, scalable, real-time system with proper schema handling, high fan-out capabilities, plug-and-play integration, and decentralized ownership.
  • Kafka addressed robustness, scalability, high read fan-out, real-time processing, and decoupling of writers from readers, but schemas, data integration, and ownership required additional solutions.
  • LinkedIn adopted Apache Avro for schemas, developed a schema registry for versioning, and implemented a compatibility model to ensure backward compatibility.
  • To enable plug-and-play integration, they moved to a 'Schema on Write' model, ensuring clean, structured data was available in real-time for multiple consumers.
  • Ownership of schemas was shifted to the teams generating the data, with a mandatory code review process to ensure schema uniformity and documentation.
  • Kafka's lack of first-class schema support is noted as a significant drawback, with fragmentation in schema registry options and no server-side validation.
  • The article highlights the importance of schemas in data integration and questions why Kafka didn't incorporate first-class schema support despite LinkedIn's early emphasis on it.