Why was Apache Kafka created?

a day ago

https://bigdata.2minutestreaming.com/p/why-was-apache-kafka-created

Copy Link

#Kafka
#Data Integration
#Schemas

LinkedIn created Kafka to solve data integration problems, particularly handling site activity data for various uses like fraud detection, ML model training, and website features.
The old infrastructure had two main pipelines: a batch pipeline for data warehousing and a real-time pipeline for observability, both requiring manual maintenance and suffering from integration issues.
Key problems included schema parsing difficulties, system brittleness, schema evolution challenges, lag in data availability, and separation of operational metrics from activity data.
LinkedIn needed a robust, scalable, real-time system with proper schema handling, high fan-out capabilities, plug-and-play integration, and decentralized ownership.
Kafka addressed robustness, scalability, high read fan-out, real-time processing, and decoupling of writers from readers, but schemas, data integration, and ownership required additional solutions.
LinkedIn adopted Apache Avro for schemas, developed a schema registry for versioning, and implemented a compatibility model to ensure backward compatibility.
To enable plug-and-play integration, they moved to a 'Schema on Write' model, ensuring clean, structured data was available in real-time for multiple consumers.
Ownership of schemas was shifted to the teams generating the data, with a mandatory code review process to ensure schema uniformity and documentation.
Kafka's lack of first-class schema support is noted as a significant drawback, with fragmentation in schema registry options and no server-side validation.
The article highlights the importance of schemas in data integration and questions why Kafka didn't incorporate first-class schema support despite LinkedIn's early emphasis on it.

Hasty Briefsbeta

Why was Apache Kafka created?