What is Apache Kafka and how does it work?
4 hours ago
- #stream-processing
- #data-engineering
- #apache-kafka
- Apache Kafka is an open-source, distributed, durable, scalable, fault-tolerant pub/sub messaging system with stream processing and rich integration capabilities.
- It uses a log data structure for sequential append-only storage, with topics as logical data separators and partitions for sharding to enable parallelism.
- Messages are key-value pairs stored as raw bytes, requiring client-side serialization and deserialization; offsets provide unique ordering within partitions.
- Kafka operates as a distributed system with brokers, replication for fault tolerance, and a single-leader model for consistency, using KRaft for consensus.
- Features include data retention for replayability, tiered storage for cost efficiency, consumer groups for coordinated reading, and transactions for exactly-once processing.
- Extended components include Kafka Streams for stream processing, Kafka Connect for system integrations, and Schema Registry for data structure management.
- Use Kafka when needing high durability, availability, read-fanout, or event replay; avoid it for simple async tasks, queue semantics, low latency, or small scale data.