Why was Apache Kafka created?
a day ago
- #Kafka
- #Data Integration
- #Schemas
- LinkedIn created Kafka to solve data integration problems, particularly handling site activity data for various uses like fraud detection, ML model training, and website features.
- The old infrastructure had two main pipelines: a batch pipeline for data warehousing and a real-time pipeline for observability, both requiring manual maintenance and suffering from integration issues.
- Key problems included schema parsing difficulties, system brittleness, schema evolution challenges, lag in data availability, and separation of operational metrics from activity data.
- LinkedIn needed a robust, scalable, real-time system with proper schema handling, high fan-out capabilities, plug-and-play integration, and decentralized ownership.
- Kafka addressed robustness, scalability, high read fan-out, real-time processing, and decoupling of writers from readers, but schemas, data integration, and ownership required additional solutions.
- LinkedIn adopted Apache Avro for schemas, developed a schema registry for versioning, and implemented a compatibility model to ensure backward compatibility.
- To enable plug-and-play integration, they moved to a 'Schema on Write' model, ensuring clean, structured data was available in real-time for multiple consumers.
- Ownership of schemas was shifted to the teams generating the data, with a mandatory code review process to ensure schema uniformity and documentation.
- Kafka's lack of first-class schema support is noted as a significant drawback, with fragmentation in schema registry options and no server-side validation.
- The article highlights the importance of schemas in data integration and questions why Kafka didn't incorporate first-class schema support despite LinkedIn's early emphasis on it.