Wednesday, October 6, 2021

What is Apache Kafka?

Introduction
Apache Kafka is an event streaming platform.

Let's talk in detailed about it.
Kafka is an event based messaging system that safely moves data between systems. Depending on how each component is configured, it can act as a transport for real-time event tracking or as a replicated distributed database.
Kafka combines three key capabilities so you can implement your use cases for event streaming end-to-end with a single battle-tested solution:
  • To publish (write) and subscribe to (read) streams of events, including continuous import/export of your data from other systems.
  • To store streams of events durably and reliably for as long as you want.
  • To process streams of events as they occur or retrospectively.
And all this functionality is provided in a distributed, highly scalable, elastic, fault-tolerant, and secure manner.

How Apache Kafka works?
Kafka is a distributed system consisting of servers and clients that communicate via a high-performance TCP network protocol.

Servers: Kafka is run as a cluster of one or more servers that can span multiple datacenters or cloud regions. Some of these servers form the storage layer, called the brokers. Other servers run Kafka Connect to continuously import and export data as event streams to integrate Kafka with your existing systems such as relational databases as well as other Kafka clusters.

Clients: They allow you to write distributed applications and microservices that read, write, and process streams of events in parallel, at scale, and in a fault-tolerant manner even in the case of network problems or machine failures. Kafka ships with some such clients included, which are augmented by dozens of clients provided by the Kafka community: clients are available for Java and Scala including the higher-level Kafka Streams library, for Go, Python, C/C++, and many other programming languages as well as REST APIs.

Kafka APIs
In addition to command line tooling for management and administration tasks, Kafka has five core APIs for Java and Scala:
  1. The Admin API to manage and inspect topics, brokers, and other Kafka objects.
  2. The Producer API to publish (write) a stream of events to one or more Kafka topics.
  3. The Consumer API to subscribe to (read) one or more topics and to process the stream of events produced to them.
  4. The Kafka Streams API to implement stream processing applications and microservices.
  5. The Kafka Connect API to build and run reusable data import/export connectors that consume (read) or produce (write) streams of events from and to external systems and applications so they can integrate with Kafka.

Let's see some use cases
  • Messaging: Kafka works well as a replacement for a more traditional message broker. Message brokers are used for a variety of reasons (to decouple processing from data producers, to buffer unprocessed messages, etc). In comparison to most messaging systems Kafka has better throughput, built-in partitioning, replication, and fault-tolerance which makes it a good solution for large scale message processing applications.
  • Website Activity Tracking: The original use case for Kafka was to be able to rebuild a user activity tracking pipeline as a set of real-time publish-subscribe feeds. This means site activity (page views, searches, or other actions users may take) is published to central topics with one topic per activity type.
  • Metrics: Kafka is often used for operational monitoring data. This involves aggregating statistics from distributed applications to produce centralised feeds of operational data.
  • Log Aggregation: Many people use Kafka as a replacement for a log aggregation solution. Log aggregation typically collects physical log files.
  • Stream Processing: Many users of Kafka process data in processing pipelines consisting of multiple stages, where raw input data is consumed from Kafka topics and then aggregated, enriched, or otherwise transformed into new topics for further consumption or follow-up processing.
  • Event Sourcing: Event sourcing is a style of application design where state changes are logged as a time-ordered sequence of records. Kafka's support for very large stored log data makes it an excellent backend for an application built in this style.

Let's see some common terms

Term

Description

Cluster

The collective group of machines that Kafka is running on

Broker

A single Kafka instance

Topic

Topics are used to organize data. You always read and write to and from a particular topic

Partition

Data in a topic is spread across a number of partitions. Each partition can be thought of as a log file, ordered by time. To guarantee that you read messages in the correct order, only one instance can read from a particular partition at a time.

Producer

A client that writes data to one or more Kafka topics

Consumer

A client that reads data from one or more Kafka topics

Replica

Partitions are typically replicated to one or more brokers to avoid data loss.

Leader

Although a partition may be replicated to one or more brokers, a single broker is elected the leader for that partition, and is the only one who is allowed to write or read to/from that partition

Consumer group

A collective group of consumer instances, identified by a groupId. In a horizontally scaled application, each instance would be a consumer and together they would act as a consumer group.

Group Coordinator

An instance in the consumer group that is responsible for assigning partitions to consume from to the consumers in the group

Offset

A certain point in the partition log. When a consumer has consumed a message, it "commits" that offset, meaning that it tells the broker that the consumer group has consumed that message. If the consumer group is restarted, it will restart from the highest committed offset.



So this was introduction post about "Apache Kafka", in next post I will explain it's basic setup.

No comments:

Post a Comment