W Java133

APACHE KAFKA ESSENTIALS
Apache Kafka Essentials

TABLE OF CONTENTS
Preface 2
Introduction 2
Installing and Configuring Kafka 2
Downloading Kafka. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Extracting the Archive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Configuring Kafka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Starting Kafka and ZooKeeper 2

Starting ZooKeeper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Starting Kafka Broker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Creating and Managing Topics 2

Creating a Topic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Listing Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Describing a Topic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Producing and Consuming Messages 3

Producing Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Consuming Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Consumer Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Configuring Kafka Producers and Consumers 3

Producer Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Consumer Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Kafka Connect 4
Source Connectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Sink Connectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Running Kafka Connect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Monitoring Kafka Connect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Kafka Streams 5
Key Concepts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Kafka Streams Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Stateful Processing with Kafka Streams. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Windowing Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Interactive Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Error Handling and Fault Tolerance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Integration with Kafka Connect and Kafka Producer/Consumer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Kafka Security 6
Authentication and Authorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

1 APACHE KAFKA ESSENTIALS
Encryption. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Secure Replication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Replication Factor 7
How Replication Factor Works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Modifying Replication Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Partitions 8
How Partitions Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Benefits of Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Partition Key . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Choosing the Number of Partitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Modifying Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Batch Size 9
How Batch Size Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Configuring Batch Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Monitoring Batch Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Compression 10
How Compression Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Compression Algorithms in Kafka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Configuring Compression in Kafka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Compression in Kafka Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Considerations for Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Retention Policy 11
How Retention Policy Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Configuring Retention Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Size-based Retention. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Log Compaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Considerations for Retention Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Kafka Monitoring and Management 12

Handling Data Serialization 13
Kafka Ecosystem: Additional Components 13
Conclusion 14

PREFACE
PREFACE CONFIGURING KAFKA
This cheatsheet is designed to be your quick and Navigate to the config directory and modify the
handy reference to the essential concepts, following configuration files as needed:
commands, and best practices associated with
Apache Kafka. Whether you are a seasoned Kafka server.properties: Main Kafka broker
expert looking for a convenient memory aid or a configuration.
newcomer exploring the world of distributed
systems, this cheatsheet will serve as your reliable zookeeper.properties: ZooKeeper configuration for
companion. Kafka.
INTRODUCTION STARTINGKAFKA
STARTING KAFKAAND
ANDZOOKEEPER
ZOOKEEPER
INTRODUCTION
Apache Kafka is a distributed event streaming To run Kafka, you need to start ZooKeeper first, as
platform designed to handle large-scale real-time Kafka depends on ZooKeeper for maintaining its
data streams. It was originally developed by cluster state. Here’s how to do it:
LinkedIn and later open-sourced as an Apache
project. Kafka is known for its high-throughput, STARTING ZOOKEEPER
fault-tolerance, scalability, and low-latency
characteristics, making it an excellent choice for
various use cases, such as real-time data pipelines, bin/zookeeper-server-start.sh
stream processing, log aggregation, and more. config/zookeeper.properties
Kafka follows a publish-subscribe messaging model,

where producers publish messages to topics, and STARTING KAFKA BROKER
consumers subscribe to those topics to receive and
process the messages. To start the Kafka broker, use the following
command:
INSTALLINGAND
INSTALLING ANDCONFIGURING
CONFIGURING
KAFKA
KAFKA bin/kafka-server-start.sh
config/server.properties
To get started with Apache Kafka, you need to
download and set up the Kafka distribution. Here’s
how you can do it: CREATINGAND
CREATING ANDMANAGING
MANAGINGTOPICS
TOPICS
DOWNLOADING KAFKA Topics in Kafka are logical channels where

messages are published and consumed. Let’s learn
Visit the Apache Kafka website how to create and manage topics:
(https://kafka.apache.org/downloads) and download
the latest stable version.
CREATING A TOPIC
EXTRACTING THE ARCHIVE To create a topic, use the following command:
After downloading the Kafka archive, extract it to

your desired location using the following bin/kafka-topics.sh --create --topic
commands: my_topic --bootstrap-server
localhost:9092 --partitions 3
--replication-factor 1
# Replace kafka_version with the
version you downloaded
tar -xzf kafka_version.tgz In this example, we create a topic named my_topic
cd kafka_version with three partitions and a replication factor of 1.

LISTING TOPICS CONSUMER GROUPS
To list all the topics in the Kafka cluster, use the Consumer groups allow multiple consumers to
following command: work together to read from a topic. Each consumer
in a group will get a subset of the messages. To use
consumer groups, provide a group id when
bin/kafka-topics.sh --list consuming messages:
--bootstrap-server localhost:9092
bin/kafka-console-consumer.sh
DESCRIBING A TOPIC --topic my_topic --bootstrap-server
localhost:9092 --group
To get detailed information about a specific topic, my_consumer_group
use the following command:
CONFIGURINGKAFKA
CONFIGURING KAFKAPRODUCERS
PRODUCERS
bin/kafka-topics.sh --describe
AND
AND CONSUMERS
CONSUMERS
--topic my_topic --bootstrap-server
localhost:9092
Kafka provides various configurations for
producers and consumers to optimize their
behavior. Here are some essential configurations:
PRODUCINGAND
PRODUCING ANDCONSUMING
CONSUMING
MESSAGES
MESSAGES PRODUCER CONFIGURATION
Now that we have a topic, let’s explore how to
To configure a Kafka producer, create a
produce and consume messages in Kafka.
producer.properties file and set properties like
bootstrap.servers, key.serializer, and
PRODUCING MESSAGES value.serializer.
To produce messages to a Kafka topic, use the

following command: # producer.properties
bootstrap.servers=localhost:9092
bin/kafka-console-producer.sh
key.serializer=org.apache.kafka.comm
on.serialization.StringSerializer
localhost:9092
value.serializer=org.apache.kafka.co
mmon.serialization.StringSerializer
After running this command, you can start typing
your messages. Press Enter to send each message.
Use the following command to run the producer
with the specified configuration:
CONSUMING MESSAGES
To consume messages from a Kafka topic, use the

bin/kafka-console-producer.sh
following command:
--topic my_topic --producer.config
path/to/producer.properties
bin/kafka-console-consumer.sh
localhost:9092 CONSUMER CONFIGURATION
For consumer configuration, create a

This will start consuming messages from the consumer.properties file with properties like
specified topic in the console. bootstrap.servers, key.deserializer, and

value.deserializer. like MySQL, PostgreSQL, MongoDB, etc.
SINK CONNECTORS
# consumer.properties
Sink Connectors allow you to export data from
bootstrap.servers=localhost:9092 Kafka to external systems. They act as consumers,
key.deserializer=org.apache.kafka.co reading data from Kafka topics and writing it to the
mmon.serialization.StringDeserialize target systems. Some popular sink connectors
r include:
value.deserializer=org.apache.kafka.
• JDBC Sink Connector: Writes data from Kafka
common.serialization.StringDeseriali
topics to relational databases using JDBC.
zer
group.id=my_consumer_group • HDFS Sink Connector: Stores data from Kafka
topics in Hadoop Distributed File System
(HDFS).
Run the consumer using the configuration file:
• Elasticsearch Sink Connector: Indexes data
from Kafka topics into Elasticsearch for search
bin/kafka-console-consumer.sh and analysis.
--topic my_topic --consumer.config

path/to/consumer.properties CONFIGURATION
To configure Kafka Connect, you typically use a

KAFKACONNECT
CONNECT properties file for each connector. The properties
KAFKA
file contains essential information like the
connector name, Kafka brokers, topic
Kafka Connect is a powerful framework that allows
configurations, and connector-specific properties.
you to easily integrate Apache Kafka with external
Each connector may have its own set of required
systems. It is designed to provide scalable and fault-
and optional properties.
tolerant data movement between Kafka and other
data storage systems or data processing platforms.
Here’s a sample configuration for the FileStream
Kafka Connect is ideal for building data pipelines
Source Connector:
and transferring data to and from Kafka without
writing custom code for each integration.
name=my-file-source-connector
Kafka Connect consists of two main components:
connector.class=org.apache.kafka.con
Source Connectors and Sink Connectors.
nect.file.FileStreamSourceConnector
tasks.max=1
SOURCE CONNECTORS
file=/path/to/inputfile.txt
Source Connectors allow you to import data from topic=my_topic
various external systems into Kafka. They act as
producers, capturing data from the source and
writing it to Kafka topics. Some popular source RUNNING KAFKA CONNECT
connectors include:
To run Kafka Connect, you can use the connect-
• JDBC Source Connector: Captures data from standalone.sh or connect-distributed.sh scripts that
relational databases using JDBC. come with Kafka.
• FileStream Source Connector: Reads data Standalone Mode

from files in a specified directory and streams
them to Kafka. In standalone mode, Kafka Connect runs on a single
• Debezium Connectors: Provides connectors machine, and each connector is managed by a
for capturing changes from various databases separate process. Use the connect-standalone.sh

script to run connectors in standalone mode: incoming data records and produces new
output records.
• Topology: A topology defines the stream

bin/connect-standalone.sh
processing flow by connecting processors
config/connect-standalone.properties
together to form a processing pipeline.
config/your-connector.properties
• Windowing: Kafka Streams supports
windowing operations, allowing you to group
Distributed Mode records within specified time intervals for
processing.
In distributed mode, Kafka Connect runs as a
cluster, providing better scalability and fault • Stateful Processing: Kafka Streams supports
tolerance. Use the connect-distributed.sh script to stateful processing, where the processing logic
run connectors in distributed mode: considers historical data within a specified
window.
bin/connect-distributed.sh KAFKA STREAMS APPLICATION

config/connect-
distributed.properties To create a Kafka Streams application, you need to
set up a Kafka Streams topology and define the
processing steps. Here’s a high-level overview of the
MONITORING KAFKA CONNECT steps involved:
Kafka Connect exposes several metrics that can be Create a Properties Object
monitored for understanding the performance and
health of your connectors. You can use tools like Start by creating a Properties object to configure
JConsole, JVisualVM, or integrate Kafka Connect your Kafka Streams application. This includes
with monitoring systems like Prometheus and properties like the Kafka broker address,
Grafana to monitor the cluster. application ID, default serializers, and deserializers.
KAFKASTREAMS
KAFKA STREAMS
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_
Kafka Streams is a client library in Apache Kafka
ID_CONFIG, "my-streams-app");
that enables real-time stream processing of data. It
allows you to build applications that consume data
props.put(StreamsConfig.BOOTSTRAP_SE
from Kafka topics, process the data, and produce RVERS_CONFIG, "localhost:9092");
the results back to Kafka or other external systems. props.put(StreamsConfig.DEFAULT_KEY_
Kafka Streams provides a simple and lightweight SERDE_CLASS_CONFIG,
approach to stream processing, making it an Serdes.String().getClass());
attractive choice for building real-time data props.put(StreamsConfig.DEFAULT_VALU
processing pipelines. E_SERDE_CLASS_CONFIG,
Serdes.String().getClass());
KEY CONCEPTS
Before diving into the details of Kafka Streams, let’s Define the Topology
explore some key concepts:
Next, define the topology of your Kafka Streams
• Stream: A continuous flow of data records in application. This involves creating processing steps
Kafka is represented as a stream. Each record and connecting them together.
in the stream consists of a key, a value, and a
timestamp.
StreamsBuilder builder = new
• Processor: A processor is a fundamental StreamsBuilder();
building block in Kafka Streams that processes

// Create a stream from a Kafka INTERACTIVE QUERIES

topic
Kafka Streams also enables interactive queries,
KStream<String, String> inputStream
allowing you to query the state stores used in your
= builder.stream("input_topic");
stream processing application.
// Perform processing operations

ERROR HANDLING AND FAULT
KStream<String, String>
TOLERANCE
processedStream = inputStream
.filter((key, value) -> Kafka Streams applications are designed to be fault-
value.startsWith("important_")) tolerant. They automatically handle and recover
.mapValues(value -> from failures, ensuring continuous data processing.
value.toUpperCase());
INTEGRATION WITH KAFKA CONNECT
// Send the processed data to AND KAFKA PRODUCER/CONSUMER
another Kafka topic
processedStream.to("output_topic"); Kafka Streams can easily integrate with Kafka
Connect to move data between Kafka topics and
external systems. Additionally, you can use Kafka
// Build the topology
producers and consumers within Kafka Streams
Topology topology = builder.build();
applications to interact with external systems and
services.
Create and Start the Kafka Streams Application
KAFKASECURITY
KAFKA SECURITY
Once the topology is defined, create a KafkaStreams
object with the defined properties and topology, Ensuring the security of your Apache Kafka cluster
and start the application: is critical to protecting sensitive data and
preventing unauthorized access. Kafka provides
various security features and configurations to
KafkaStreams streams = new safeguard your data streams. Let’s explore some
KafkaStreams(topology, props); essential aspects of Kafka security:
streams.start();
AUTHENTICATION AND AUTHORIZATION
STATEFUL PROCESSING WITH KAFKA Kafka supports both authentication and

STREAMS authorization mechanisms to control access to the
cluster.
Kafka Streams provides state stores that allow you
to maintain stateful processing across data records. Authentication
You can define a state store and use it within your
processing logic to maintain state information. Kafka offers several authentication options,
including:
WINDOWING OPERATIONS
• SSL Authentication: Secure Sockets Layer
Kafka Streams supports windowing operations, (SSL) enables encrypted communication
allowing you to group data records within specific between clients and brokers, ensuring secure
time windows for aggregation or processing. authentication.
Windowing is essential for time-based operations • SASL Authentication: Simple Authentication
and calculations. and Security Layer (SASL) provides pluggable
authentication mechanisms, such as PLAIN,
SCRAM, and GSSAPI (Kerberos).

Authorization Configure controlled shutdown to ensure brokers

shut down gracefully without causing data loss or
Kafka allows fine-grained control over access to inconsistency during replication.
topics and operations using Access Control Lists
(ACLs). With ACLs, you can define which users or Security Configuration
groups are allowed to read, write, or perform other
actions on specific topics. To enable security features in Kafka, you need to
modify the Kafka broker configuration and adjust
the client configurations accordingly.
ENCRYPTION
Kafka provides data encryption to protect data Broker Configuration

while it’s in transit between clients and brokers.
In the server.properties file, you can configure the
SSL Encryption following security-related properties:
SSL encryption, when combined with

authentication, ensures secure communication
listeners=PLAINTEXT://:9092,SSL://:9
between clients and brokers by encrypting the data 093
transmitted over the network. security.inter.broker.protocol=SSL
ssl.keystore.location=/path/to/keyst
Encryption at Rest ore.jks
ssl.keystore.password=keystore_passw
To protect data at rest, you can enable disk-level
ord
encryption on the Kafka brokers.
ssl.key.password=key_password
Secure ZooKeeper
Client Configuration
As Kafka relies on ZooKeeper for cluster
coordination, securing ZooKeeper is also crucial. In the client applications, you need to set the
security properties to match the broker
Chroot
configuration:
Kafka allows you to isolate the ZooKeeper instance

used by Kafka by using a chroot path. This helps Properties props = new Properties();
prevent other applications from accessing Kafka’s
props.put("bootstrap.servers",
ZooKeeper instance.
"localhost:9093");
Secure ACLs props.put("security.protocol",
"SSL");
Ensure that the ZooKeeper instance used by Kafka props.put("ssl.keystore.location",
has secure ACLs set up to restrict access to "/path/to/client_keystore.jks");
authorized users and processes. props.put("ssl.keystore.password",
"client_keystore_password");
SECURE REPLICATION props.put("ssl.key.password",
"client_key_password");
If you have multiple Kafka brokers, securing
replication between them is essential.
REPLICATIONFACTOR
REPLICATION FACTOR
Inter-Broker Encryption
Replication factor is a crucial concept in Apache

Enable SSL encryption for inter-broker
Kafka that ensures data availability and fault
communication to ensure secure data replication.
tolerance within a Kafka cluster. It defines the
number of copies, or replicas, of each Kafka topic
Controlled Shutdown
partition that should be maintained across the

brokers in the cluster. By having multiple replicas message’s key or using a round-robin mechanism if
of each partition, Kafka ensures that even if some no key is provided.
brokers or machines fail, the data remains
accessible and the cluster remains operational. BENEFITS OF PARTITIONS
HOW REPLICATION FACTOR WORKS Partitioning provides several advantages:
When a new topic is created or when an existing Benefit Description

topic is configured to have a specific replication
Scalability Partitions enable
factor, Kafka automatically replicates each partition
horizontal scaling of
across multiple brokers. The partition leader is the
Kafka, as data can be
primary replica responsible for handling read and
distributed across
write requests for that partition, while the other
multiple brokers. This
replicas are called follower replicas.
allows Kafka to handle
large volumes of data
MODIFYING REPLICATION FACTOR and high-throughput
workloads.
Changing the replication factor of an existing topic
involves reassigning partitions and adding or Parallelism With multiple partitions,
removing replicas. This process should be Kafka can process and
performed carefully, as it may impact the store messages in
performance of the cluster during rebalancing. parallel. Each partition
acts as an independent
To increase the replication factor, you need to add unit, allowing multiple
new brokers and then reassign the partitions with consumers to process
the new replication factor using the kafka-reassign- data simultaneously,
partitions.sh tool. which improves overall
system performance.
To decrease the replication factor, you need to
Load Balancing Kafka can distribute
reassign the partitions and remove replicas before
partitions across
removing the brokers from the cluster.
brokers, which balances
the data load and
PARTITIONS
PARTITIONS prevents any single
broker from becoming a
Partitions are a fundamental concept in Apache bottleneck.
Kafka that allows data to be distributed and
parallelized across multiple brokers in a Kafka
cluster. A topic in Kafka is divided into one or more PARTITION KEY
partitions, and each partition is a linearly ordered
When producing messages to a Kafka topic, you can
sequence of messages. Understanding partitions is
specify a key for each message. The key is optional,
crucial for optimizing data distribution, load
and if not provided, messages are distributed to
balancing, and managing data retention within
partitions using a round-robin approach. When a
Kafka.
key is provided, Kafka uses the key to determine the
partition to which the message will be written.
HOW PARTITIONS WORK
When a topic is created, it is divided into a CHOOSING THE NUMBER OF PARTITIONS

configurable number of partitions. Each partition is
The number of partitions for a topic is an important
hosted on a specific broker in the Kafka cluster. The
consideration and should be chosen carefully based
number of partitions in a topic can be set when
on your use case and requirements.
creating the topic, and the partitions remain fixed
after creation. Messages produced to a topic are
written to one of its partitions based on the

Consideration Description Decreasing the number of partitions is more

challenging and might involve reassigning
Concurrency and A higher number of messages manually to maintain data integrity.
Throughput partitions allows for
more parallelism and BATCHSIZE
SIZE
concurrency during BATCH
message production and
Batch size in Apache Kafka refers to the number of
consumption. It is
messages that are accumulated and sent together as
particularly useful when
a batch from producers to brokers. By sending
you have multiple
messages in batches instead of individually, Kafka
producers or consumers
can achieve better performance and reduce
and need to achieve
network overhead. Configuring an appropriate
high throughput.
batch size is essential for optimizing Kafka
Balanced Workload The number of producer performance and message throughput.
partitions should be
greater than or equal to
HOW BATCH SIZE WORKS
the number of
consumers in a When a Kafka producer sends messages to a
consumer group. This broker, it can choose to batch multiple messages
ensures a balanced together before sending them over the network.
workload distribution The producer collects messages until the batch size
among consumers, reaches a configured limit or until a certain time
avoiding idle consumers period elapses. Once the batch size or time limit is
and improving overall reached, the producer sends the entire batch to the
consumption efficiency. broker in a single request.
Resource Considerations Keep in mind that
increasing the number CONFIGURING BATCH SIZE
of partitions increases
the number of files and In Kafka, you can configure the batch size for a
resources needed to producer using the batch.size property. This
manage them. Thus, it property specifies the maximum number of bytes
can impact disk space that a batch can contain. The default value is 16384
and memory usage on bytes (16KB).
the brokers.
You can adjust the batch size based on your use
case, network conditions, and message size. Setting
MODIFYING PARTITIONS a larger batch size can improve throughput, but it
might also increase the latency for individual
Once a topic is created with a specific number of
messages within the batch. Conversely, a smaller
partitions, the number of partitions cannot be
batch size may reduce latency but could result in a
changed directly. Adding or reducing partitions
higher number of requests and increased network
requires careful planning and involves the
overhead.
following steps:
Increasing Partitions MONITORING BATCH SIZE
To increase the number of partitions, you can Monitoring the batch size is crucial for optimizing
create a new topic with the desired partition count producer performance. You can use Kafka’s built-in
and use Kafka tools like kafka-topics.sh to reassign metrics and monitoring tools to track batch size-
messages from the old topic to the new one. related metrics, such as average batch size,
maximum batch size, and batch send time.
Decreasing Partitions

COMPRESSION
COMPRESSION Compression Description
Algorithm
Compression in Apache Kafka is a feature that
allows data to be compressed before it is stored on LZ4 LZ4 is another fast
brokers or transmitted between producers and compression algorithm
consumers. Kafka supports various compression that provides even
algorithms to reduce data size, improve network lower compression
utilization, and enhance overall system ratios than Snappy but
performance. Understanding compression options with even lower
in Kafka is essential for optimizing storage and data processing overhead.
transfer efficiency. Like Snappy, it is well-
suited for low-latency
use cases.
HOW COMPRESSION WORKS
Zstandard (Zstd) Zstd is a more recent
When a producer sends messages to Kafka, it can addition to Kafka’s
choose to compress the messages before compression options. It
transmitting them to the brokers. Similarly, when provides a good balance
messages are stored on the brokers, Kafka can between compression
apply compression to reduce the storage footprint. ratios and processing
On the consumer side, messages can be speed, making it a
decompressed before being delivered to consumers. versatile choice for
various use cases.
COMPRESSION ALGORITHMS IN KAFKA
CONFIGURING COMPRESSION IN KAFKA
Kafka supports the following compression
algorithms: To enable compression in Kafka, you need to
configure the producer and broker properties.
Compression Description
Algorithm Producer Configuration
Gzip Gzip is a widely used

In the producer configuration, you can set the
compression algorithm
compression.type property to specify the
that provides good
compression algorithm to use. For example:
compression ratios. It is
suitable for text-based
data, such as logs or compression.type=gzip
JSON messages.
Snappy Snappy is a fast and

Broker Configuration
efficient compression
algorithm that offers
In the broker configuration, you can specify the
lower compression
compression type for both producer and consumer
ratios compared to Gzip
requests using the compression.type property. For
but with reduced
example:
processing overhead. It
is ideal for scenarios
where low latency is compression.type=gzip
critical, such as real-
time stream processing.
COMPRESSION IN KAFKA STREAMS
When using Apache Kafka Streams, you can also

configure compression for the state stores used in
your stream processing application. This can help

reduce storage requirements for stateful data in the retention policy determines when Kafka will
Kafka Streams application. automatically delete old data from topics, helping to
manage storage usage and prevent unbounded data
growth.
CONSIDERATIONS FOR COMPRESSION
While compression offers several benefits, it is HOW RETENTION POLICY WORKS

essential to consider the following factors when
deciding whether to use compression: When a message is produced to a Kafka topic, it is
written to a partition on the broker. The retention
Consideration Description policy defines how long messages within a partition
are kept before they are eligible for deletion. Kafka
Compression Overhead Applying compression
uses a combination of time-based and size-based
and decompression adds
retention to determine which messages to retain
some processing
and which to delete.
overhead, so it’s
essential to evaluate the
impact on producer and CONFIGURING RETENTION POLICY
consumer performance.
The retention policy can be set at both the topic
Message Size Compression is more level and the broker level.
effective when dealing
with larger message Topic-level Retention Policy
sizes. For very small
messages, the overhead When creating a Kafka topic, you can specify the
of compression might retention policy using the retention.ms property.
outweigh the benefits. This property sets the maximum time, in
milliseconds, that a message can be retained in the
Latency Some compression
topic.
algorithms, like Gzip,
might introduce For example, to set a retention policy of 7 days for a
additional latency due to topic:
the compression
process. Consider the
latency requirements of bin/kafka-topics.sh --zookeeper
your use case. localhost:2181 --create --topic
Monitoring Compression Monitoring compression my_topic --partitions 3
Efficiency efficiency is crucial to --replication-factor 2 --config
understand how well retention.ms=604800000
compression is working
for your Kafka cluster.
You can use Kafka’s Broker-level Retention Policy
built-in metrics to
You can also set a default retention policy at the
monitor the
broker level in the server.properties file. The
compression rate and
log.retention.hours property specifies the default
the size of compressed
retention time for topics that don’t have a specific
and uncompressed
retention policy set.
messages.
For example, to set a default retention policy of 7

RETENTIONPOLICY
RETENTION POLICY days at the broker level:
Retention policy in Apache Kafka defines how long

data is retained on brokers within a Kafka cluster. log.retention.hours=168
Kafka allows you to set different retention policies
at both the topic level and the broker level. The

SIZE-BASED RETENTION Consideration Description
In addition to time-based retention, Kafka also Storage Capacity Ensure that your Kafka
supports size-based retention. With size-based cluster has sufficient
retention, you can set a maximum size for the storage capacity to
partition log. Once the log size exceeds the specified retain data for the
value, the oldest messages in the log are deleted to desired retention
make space for new messages. period, especially if you
are using size-based
To enable size-based retention, you can use the retention or log
log.retention.bytes property. For example: compaction.
Message Consumption Consider the rate at

Rate which messages are
log.retention.bytes=1073741824
produced and
consumed. If the
consumption rate is
LOG COMPACTION
slower than the
In addition to time and size-based retention, Kafka production rate, you
also provides a log compaction feature. Log might need a longer
compaction retains only the latest message for each retention period to
unique key in a topic, ensuring that the most recent allow consumers to
value for each key is always available. This feature catch up.
is useful for maintaining the latest state of an entity Message Importance For some topics, older
or for storing changelog-like data. messages might become
less important over
To enable log compaction for a topic, you can use time. In such cases, you
the cleanup.policy property. For example: can use a shorter
retention period to
reduce storage usage.
cleanup.policy=compact
KAFKAMONITORING
KAFKA MONITORINGAND
AND
CONSIDERATIONS FOR RETENTION MANAGEMENT
MANAGEMENT
POLICY
Monitoring Kafka is essential to ensure its smooth
When configuring the retention policy, consider the operation. Here are some tools and techniques for
following factors: effective Kafka monitoring:
Consideration Description Monitoring Tool Description
Data Requirements Choose a retention JMX Metrics Kafka exposes various

period that aligns with metrics through Java
your data retention Management Extensions
requirements. Consider (JMX). Tools like
the business needs and JConsole and JVisualVM
any regulatory or can help monitor
compliance Kafka’s internal metrics.
requirements for data
retention.

Monitoring Tool Description Data Serialization Description
Kafka Manager Kafka Manager is a web- JSON Kafka supports JSON as

based tool that provides a data format for
a graphical user messages. JSON is
interface for managing human-readable and
and monitoring Kafka easy to work with,
clusters. It offers making it suitable for
features like topic many use cases.
management, consumer
String Kafka allows data to be
group monitoring, and
serialized as plain
partition reassignment.
strings. In this method,
Prometheus & Grafana Integrate Kafka with the data is sent as
Prometheus, a strings without any
monitoring and alerting specific data structure
toolkit, and Grafana, a or schema.
data visualization tool,
Bytes The Bytes serialization is
to build custom
a generic way to handle
dashboards for in-depth
arbitrary binary data.
monitoring and analysis.
With this method, users
Logging Configure Kafka’s can manually serialize
logging to capture their data into bytes and
relevant information for send it to Kafka as raw
troubleshooting and binary data.
performance analysis.
Protobuf Google Protocol Buffers
Proper logging enables
(Protobuf) offer an
easier identification of
efficient binary format
issues.
for data serialization.
Using Protobuf can
HANDLINGDATA
HANDLING DATASERIALIZATION
SERIALIZATION reduce message size and
improve performance.
Kafka allows you to use different data serializers
for your messages. Here’s how you can handle data KAFKAECOSYSTEM:
ECOSYSTEM:ADDITIONAL
ADDITIONAL
serialization in Apache Kafka:
KAFKA
COMPONENTS
COMPONENTS
Data Serialization Description
Kafka’s ecosystem offers various additional
Avro Apache Avro is a components that extend its capabilities. Here are
popular data some essential ones:
serialization system. You
can use Avro with Kafka Tool/component Description
to enforce schema
Kafka MirrorMaker Kafka MirrorMaker is a
evolution and provide a
tool for replicating data
compact, efficient
between Kafka clusters,
binary format for
enabling data
messages.
synchronization across
different environments.

Tool/component Description
Kafka Connect Kafka Connect

Converters Converters handle data
format conversion
between Kafka and
other systems when
using Kafka Connect.
Kafka REST Proxy Kafka REST Proxy allows

clients to interact with
Kafka using HTTP/REST
calls, making it easier to
integrate with non-Java
applications.
Schema Registry Schema Registry

manages Avro schemas
for Kafka messages,
ensuring compatibility
and versioning.
CONCLUSION
CONCLUSION
This was the Apache Kafka Essentials Cheatsheet,

providing you with a quick reference to the
fundamental concepts and commands for using
Apache Kafka. As you delve deeper into the world
of Kafka, remember to explore the official
documentation and community resources to gain a
more comprehensive understanding of this
powerful event streaming platform.
JCG delivers over 1 million pages each month to more than 700K software
developers, architects and decision makers. JCG offers something for everyone,
including news, tutorials, cheat sheets, research guides, feature articles, source code
and more.
CHEATSHEET FEEDBACK
WELCOME
support@javacodegeeks.com
Copyright © 2014 Exelixis Media P.C. All rights reserved. No part of this publication may be SPONSORSHIP
reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, OPPORTUNITIES
mechanical, photocopying, or otherwise, without prior written permission of the publisher. sales@javacodegeeks.com

W Java133

Uploaded by

Copyright:

Available Formats

You might also like

W Java133

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

W Java133

Uploaded by

Copyright:

Available Formats

APACHE KAFKA ESSENTIALS

Apache Kafka Essentials

Extracting the Archive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Starting Kafka and ZooKeeper 2

Starting Kafka Broker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Creating and Managing Topics 2

Producing and Consuming Messages 3

Configuring Kafka Producers and Consumers 3

Running Kafka Connect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Monitoring Kafka Connect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Kafka Streams Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Stateful Processing with Kafka Streams. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Error Handling and Fault Tolerance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Integration with Kafka Connect and Kafka Producer/Consumer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

Modifying Replication Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Choosing the Number of Partitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Configuring Batch Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Monitoring Batch Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Compression Algorithms in Kafka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Configuring Compression in Kafka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Compression in Kafka Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Considerations for Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Configuring Retention Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Considerations for Retention Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Kafka Monitoring and Management 12

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

Kafka follows a publish-subscribe messaging model,

DOWNLOADING KAFKA Topics in Kafka are logical channels where

EXTRACTING THE ARCHIVE To create a topic, use the following command:

After downloading the Kafka archive, extract it to

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

LISTING TOPICS CONSUMER GROUPS

To produce messages to a Kafka topic, use the

To consume messages from a Kafka topic, use the

For consumer configuration, create a

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

value.deserializer. like MySQL, PostgreSQL, MongoDB, etc.

--topic my_topic --consumer.config

To configure Kafka Connect, you typically use a

• FileStream Source Connector: Reads data Standalone Mode

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

• Topology: A topology defines the stream

bin/connect-distributed.sh KAFKA STREAMS APPLICATION

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

// Create a stream from a Kafka INTERACTIVE QUERIES

// Perform processing operations

STATEFUL PROCESSING WITH KAFKA Kafka supports both authentication and

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

Authorization Configure controlled shutdown to ensure brokers

Kafka provides data encryption to protect data Broker Configuration

SSL encryption, when combined with

Kafka allows you to isolate the ZooKeeper instance

Replication factor is a crucial concept in Apache

JAVACODEGEEKS.COM | © EXELIXIS MEDIA P.C. VISIT JAVACODEGEEKS.COM FOR MORE!

HOW REPLICATION FACTOR WORKS Partitioning provides several advantages:

When a new topic is created or when an existing Benefit Description

When a topic is created, it is divided into a CHOOSING THE NUMBER OF PARTITIONS