In this blog post, I will demonstrate the first few concepts to understand when you want to learn Apache Kafka. Documentation explains it well, but I find that concrete examples give a better feeling of the product and features.
Message and Broker
Kafka uses the message (also known as event or record) as the unit of data. Optionally, they can have a key and a value.
The broker is the process in charge of storing and providing this data. Typically, for high availability, three brokers will make a Kafka cluster.
In my lab, I setup a Kafka cluster with 3 hosts using KRaft. I will not go too much into detail but what you must know is that KRaft is the Kafka implementation of Raft protocol. It maintains cluster state and metadata.
Keep in mind that messages are immutable.
Topics
A topic is a way to organize messages within Kafka.
To create a topic, I simply trigger following command:
$ ./bin/kafka-topics.sh --create --topic events --replication-factor 1 --partitions 1 --bootstrap-server localhost:9092
Created topic events.
bootstrap server is the initial Kafka broker list where to address the request. I will explain replication factor and partition later.
Let’s run a consumer with the tool provided with Kafka package:
$ ./bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic events
And now, I can produce a simple “Hello” message:
$ ./bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic events
> Hello
Message will appear on the first windows as soon as it is produced.
Partition
For now, with the topic I have created, I have no high availability. Let’s get more information about the events topic:
./bin/kafka-topics.sh --describe --bootstrap-server localhost:9092 --topic events
Topic: events TopicId: 64J4x3HcTF-WyJD_V9kpnw PartitionCount: 1 ReplicationFactor: 1 Configs: segment.bytes=1073741824
Topic: events Partition: 0 Leader: 3 Replicas: 3 Isr: 3
From that command output, I see:
- There is only one partition (PartitionCount) which number is 0
- The partition leader is the node id 3 (the third node of the cluster, host kafka-03). The leader is responsible for handling all read and write operations for that partition.
- The replica (the main data) is on the node id 3. It is not the amount of replica.
- In Sync Replica (Isr) is on the node id 3 as well. Node 3 is obviously in sync with himself.
What is interesting to note is that I trigger the topic creation command from node 1 and, still, unique partition is on node 3.
Without any replication, that means that if I stop Kafka service on node 3, I will not be able to read message from it. kafka-console-consumer.sh
will not display anything (no error, no output). As soon as server is started again, message will be displayed.
One important point to have in mind is that consumer do not delete message when consumed as it does on a JMS (Java Message Service). Messages are only deleted by purging mechanism.
Topic with 3 Partitions
Let’s create another topic with 3 partitions and check describe output:
Topic: events3p TopicId: j3ootU5ESCCVQSZIsFdWqw PartitionCount: 3 ReplicationFactor: 1 Configs: segment.bytes=1073741824
Topic: events3p Partition: 0 Leader: 1 Replicas: 1 Isr: 1
Topic: events3p Partition: 1 Leader: 2 Replicas: 2 Isr: 2
Topic: events3p Partition: 2 Leader: 3 Replicas: 3 Isr: 3
From that command output, I see:
- Topic has 3 partitions as expected 🙂
- Each partition has its own leader.
To load balance messages across partitions, we need to produce messages with a key. I create a simple text file with following content:
1:message1
2:message2
3:message3
4:message4
5:message5
6:message6
And pipe content to the console producer script:
$ cat messages | ./bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --property "key.separator=:" --property "parse.key=true" --topic events3p
As you might have noticed, I added two options:
key.separator
to indicate how key and value can be split.parse.key
to indicate that message contains a key that can be used to send message to a specific partition.
The key is hashed and all keys having the same hash will be sent to the same partition.
Let’s run consumer again and see what is displayed:
/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic events3p --from-beginning --property print.key=true
1 message1
5 message5
2 message2
3 message3
4 message4
6 message6
As you can see messages are not ordered. Order is only guaranteed in the same partition. So, give it a try by adding argument --partition
to the command:
./bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic events3p --from-beginning --property print.key=true --partition 0
1 message1
5 message5
./bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic events3p --from-beginning --property print.key=true --partition 1
4 message4
6 message6
./bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic events3p --from-beginning --property print.key=true --partition 2
2 message2
3 message3
What if a broker is down?
If I am trying to consume while node 3 is down, data from partition hosted on that host will not be displayed (partition 2 as shown by described output). This is not good if HA is a requirement. Let’s see how to improve that.
Replica
To overcome previous chapter limitation, we can increase the replication factor. This is the amount of replica of each partition. Let’s create a topic with 3 partitions and 3 replicas:
$ ./bin/kafka-topics.sh --create --topic events3p3r --replication-factor 3 --partitions 3 --bootstrap-server localhost:9092
Created topic events3p3r.
And check the describe output:
Topic: events3p3r TopicId: A28jo7arTSCGZfpUIdl2Ag PartitionCount: 3 ReplicationFactor: 3 Configs: segment.bytes=1073741824
Topic: events3p3r Partition: 0 Leader: 1 Replicas: 1,2,3 Isr: 1,2,3
Topic: events3p3r Partition: 1 Leader: 2 Replicas: 2,3,1 Isr: 2,3,1
Topic: events3p3r Partition: 2 Leader: 3 Replicas: 3,1,2 Isr: 3,1,2
What do we learn from that output ?
- Each partition has its leader
- The first id in the replicas field is the leader.
If I stop node 3, output will change:
Topic: events3p3r TopicId: A28jo7arTSCGZfpUIdl2Ag PartitionCount: 3 ReplicationFactor: 3 Configs: segment.bytes=1073741824
Topic: events3p3r Partition: 0 Leader: 1 Replicas: 1,2,3 Isr: 1,2
Topic: events3p3r Partition: 1 Leader: 2 Replicas: 2,3,1 Isr: 2,1
Topic: events3p3r Partition: 2 Leader: 1 Replicas: 3,1,2 Isr: 1,2
As you can see, the leader of third partition (#2) changed to 1. ISR does not show node 3 as being in sync for any of the partitions.
If I bring up node 3, leader third partition will remain on node 1:
Topic: events3p3r TopicId: A28jo7arTSCGZfpUIdl2Ag PartitionCount: 3 ReplicationFactor: 3 Configs: segment.bytes=1073741824
Topic: events3p3r Partition: 0 Leader: 1 Replicas: 1,2,3 Isr: 1,2,3
Topic: events3p3r Partition: 1 Leader: 2 Replicas: 2,3,1 Isr: 2,1,3
Topic: events3p3r Partition: 2 Leader: 1 Replicas: 3,1,2 Isr: 1,2,3
Next
This is quite a long, but necessary, introduction on very important concepts of Apache Kafka. We only scratched the surface of many of the notions required to understand Kafka.
There are many more topics ( 🙂 ) to cover in regards to Kafka like:
- Kafka Connect to interface any system with Apache Kafka to ingest and consume messages
- Kafka Streams to transform, enrich messages in Kafka
- Schema registry to centralize, manage and validate message data format
- Security and encryption