In this blog post, I will demonstrate the first few concepts to understand when you want to learn Apache Kafka. Documentation explains it well, but I find that concrete examples give a better feeling of the product and features.

Message and Broker

Kafka uses the message (also known as event or record) as the unit of data. Optionally, they can have a key and a value.

The broker is the process in charge of storing and providing this data. Typically, for high availability, three brokers will make a Kafka cluster.

In my lab, I setup a Kafka cluster with 3 hosts using KRaft. I will not go too much into detail but what you must know is that KRaft is the Kafka implementation of Raft protocol. It maintains cluster state and metadata.

Keep in mind that messages are immutable.

Topics

A topic is a way to organize messages within Kafka.

To create a topic, I simply trigger following command:

$ ./bin/kafka-topics.sh --create --topic events --replication-factor 1 --partitions 1 --bootstrap-server localhost:9092
Created topic events.

bootstrap server is the initial Kafka broker list where to address the request. I will explain replication factor and partition later.

Let’s run a consumer with the tool provided with Kafka package:

$ ./bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic events

And now, I can produce a simple “Hello” message:

$ ./bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic events
> Hello

Message will appear on the first windows as soon as it is produced.

Partition

For now, with the topic I have created, I have no high availability. Let’s get more information about the events topic:

./bin/kafka-topics.sh --describe --bootstrap-server localhost:9092 --topic events
Topic: events   TopicId: 64J4x3HcTF-WyJD_V9kpnw PartitionCount: 1       ReplicationFactor: 1    Configs: segment.bytes=1073741824
        Topic: events   Partition: 0    Leader: 3       Replicas: 3     Isr: 3

From that command output, I see:

  • There is only one partition (PartitionCount) which number is 0
  • The partition leader is the node id 3 (the third node of the cluster, host kafka-03). The leader is responsible for handling all read and write operations for that partition.
  • The replica (the main data) is on the node id 3. It is not the amount of replica.
  • In Sync Replica (Isr) is on the node id 3 as well. Node 3 is obviously in sync with himself.

What is interesting to note is that I trigger the topic creation command from node 1 and, still, unique partition is on node 3.

Without any replication, that means that if I stop Kafka service on node 3, I will not be able to read message from it. kafka-console-consumer.sh will not display anything (no error, no output). As soon as server is started again, message will be displayed.

One important point to have in mind is that consumer do not delete message when consumed as it does on a JMS (Java Message Service). Messages are only deleted by purging mechanism.

Topic with 3 Partitions

Let’s create another topic with 3 partitions and check describe output:

Topic: events3p TopicId: j3ootU5ESCCVQSZIsFdWqw PartitionCount: 3       ReplicationFactor: 1    Configs: segment.bytes=1073741824
        Topic: events3p Partition: 0    Leader: 1       Replicas: 1     Isr: 1
        Topic: events3p Partition: 1    Leader: 2       Replicas: 2     Isr: 2
        Topic: events3p Partition: 2    Leader: 3       Replicas: 3     Isr: 3

From that command output, I see:

  • Topic has 3 partitions as expected 🙂
  • Each partition has its own leader.

To load balance messages across partitions, we need to produce messages with a key. I create a simple text file with following content:

1:message1
2:message2
3:message3
4:message4
5:message5
6:message6

And pipe content to the console producer script:

$ cat messages | ./bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --property "key.separator=:" --property "parse.key=true" --topic events3p

As you might have noticed, I added two options:

  • key.separator to indicate how key and value can be split.
  • parse.key to indicate that message contains a key that can be used to send message to a specific partition.

The key is hashed and all keys having the same hash will be sent to the same partition.

Let’s run consumer again and see what is displayed:

/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic events3p --from-beginning --property print.key=true
1       message1
5       message5
2       message2
3       message3
4       message4
6       message6

As you can see messages are not ordered. Order is only guaranteed in the same partition. So, give it a try by adding argument --partition to the command:

./bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic events3p --from-beginning --property print.key=true --partition 0
1       message1
5       message5
./bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic events3p --from-beginning --property print.key=true --partition 1
4       message4
6       message6
./bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic events3p --from-beginning --property print.key=true --partition 2
2       message2
3       message3

What if a broker is down?

If I am trying to consume while node 3 is down, data from partition hosted on that host will not be displayed (partition 2 as shown by described output). This is not good if HA is a requirement. Let’s see how to improve that.

Replica

To overcome previous chapter limitation, we can increase the replication factor. This is the amount of replica of each partition. Let’s create a topic with 3 partitions and 3 replicas:

$ ./bin/kafka-topics.sh --create --topic events3p3r --replication-factor 3 --partitions 3  --bootstrap-server localhost:9092
Created topic events3p3r.

And check the describe output:

Topic: events3p3r       TopicId: A28jo7arTSCGZfpUIdl2Ag PartitionCount: 3       ReplicationFactor: 3    Configs: segment.bytes=1073741824
        Topic: events3p3r       Partition: 0    Leader: 1       Replicas: 1,2,3 Isr: 1,2,3
        Topic: events3p3r       Partition: 1    Leader: 2       Replicas: 2,3,1 Isr: 2,3,1
        Topic: events3p3r       Partition: 2    Leader: 3       Replicas: 3,1,2 Isr: 3,1,2

What do we learn from that output ?

  • Each partition has its leader
  • The first id in the replicas field is the leader.

If I stop node 3, output will change:

Topic: events3p3r       TopicId: A28jo7arTSCGZfpUIdl2Ag PartitionCount: 3       ReplicationFactor: 3    Configs: segment.bytes=1073741824
        Topic: events3p3r       Partition: 0    Leader: 1       Replicas: 1,2,3 Isr: 1,2
        Topic: events3p3r       Partition: 1    Leader: 2       Replicas: 2,3,1 Isr: 2,1
        Topic: events3p3r       Partition: 2    Leader: 1       Replicas: 3,1,2 Isr: 1,2

As you can see, the leader of third partition (#2) changed to 1. ISR does not show node 3 as being in sync for any of the partitions.

If I bring up node 3, leader third partition will remain on node 1:

Topic: events3p3r       TopicId: A28jo7arTSCGZfpUIdl2Ag PartitionCount: 3       ReplicationFactor: 3    Configs: segment.bytes=1073741824
        Topic: events3p3r       Partition: 0    Leader: 1       Replicas: 1,2,3 Isr: 1,2,3
        Topic: events3p3r       Partition: 1    Leader: 2       Replicas: 2,3,1 Isr: 2,1,3
        Topic: events3p3r       Partition: 2    Leader: 1       Replicas: 3,1,2 Isr: 1,2,3

Next

This is quite a long, but necessary, introduction on very important concepts of Apache Kafka. We only scratched the surface of many of the notions required to understand Kafka.

There are many more topics ( 🙂 ) to cover in regards to Kafka like:

  • Kafka Connect to interface any system with Apache Kafka to ingest and consume messages
  • Kafka Streams to transform, enrich messages in Kafka
  • Schema registry to centralize, manage and validate message data format
  • Security and encryption