Kafka

# APACHE KAFKA : 101
1. Overview of Apache Kafka
a. Introduction
Apache Kafka is a distributed streaming platform that provides a highly

scalable and fault-tolerant system for handling real-time data streams.
It was initially developed by LinkedIn and later open-sourced, becoming
a popular choice for building data pipelines and streaming
applications. Kafka is written in Java and provides a simple and
efficient way to publish, subscribe, and process streams of records.
Kafka is designed to handle high-volume and high-throughput data

streams, making it suitable for a wide range of use cases such as
real-time analytics, log aggregation, event sourcing, and messaging
systems. It provides durability, reliability, and low-latency
processing of data streams, enabling real-time decision-making and
stream processing.
b. Key Features
Apache Kafka offers several key features that make it a powerful

streaming platform:
1. Publish-Subscribe Model: Kafka follows a publish-subscribe

messaging pattern, where producers write messages to topics, and
consumers subscribe to those topics to read the messages.
2. Distributed and Scalable: Kafka is designed to operate in a
distributed manner, allowing you to scale horizontally across
multiple machines and clusters to handle large data volumes and
high-traffic scenarios.
3. Fault-Tolerant: Kafka provides built-in replication and
fault-tolerance mechanisms, allowing you to replicate data across
multiple brokers and recover from failures without losing data.
By: Waleed Mousa

4. High Throughput and Low Latency: Kafka is optimized for
high-performance data streaming, providing high throughput and low
latency processing of data records.
5. Retention and Replay: Kafka stores all messages for a configurable
period, allowing consumers to rewind and replay messages from any
point in time.
6. Exactly-Once Semantics: Kafka provides support for exactly-once
message processing, ensuring that messages are processed only once
and in the correct order.
c. Use Cases
Apache Kafka finds applications in various domains due to its

versatility and flexibility. Some common use cases include:
1. Log Aggregation: Kafka can collect logs from multiple systems and
applications in real-time, allowing centralized log analysis and
monitoring.
2. Event Streaming and Processing: Kafka enables real-time event
streaming and processing for applications such as fraud detection,
real-time analytics, and complex event processing.
3. Messaging Systems: Kafka can be used as a reliable messaging
system, replacing traditional message brokers, and providing
features like pub-sub messaging and guaranteed message delivery.
4. Microservices Integration: Kafka acts as a communication medium
between microservices, facilitating reliable and scalable data
exchange.
5. Change Data Capture (CDC): Kafka can capture and stream database
changes, allowing real-time synchronization of data across
multiple systems.
6. Internet of Things (IoT): Kafka can handle massive amounts of data
generated by IoT devices, enabling real-time processing and
analytics on IoT data streams.
These are just a few examples of how Apache Kafka can be utilized. Its
flexibility and scalability make it a powerful tool for building modern
data pipelines and streaming applications.
By: Waleed Mousa

2. Kafka Architecture
To understand Apache Kafka's architecture, let's explore its core
components and their interactions.
a. Topics and Partitions
In Kafka, data is organized into topics. A topic is a category or

stream of records, similar to a table in a database or a folder in a
file system. Topics are divided into partitions to enable parallel
processing and scalability.
Each partition is an ordered and immutable sequence of records. When a

message is produced to a topic, Kafka appends it to the end of the
corresponding partition. Each record in a partition is assigned a
unique offset, which represents its position within that partition.
The partitioning of data across multiple partitions allows for

distributed storage and processing. It ensures that related records
with the same key are written to the same partition, preserving the
order of records for a given key.
b. Producers and Consumers
Producers are responsible for publishing (writing) records to Kafka

topics. They send messages to specific topics and partitions or let
Kafka choose the partition based on a configurable strategy. Producers
can also attach a key to each message to control how messages are
distributed across partitions.
On the other hand, consumers read records from Kafka topics. Consumers
can subscribe to one or more topics and consume messages from the
assigned partitions. Each consumer group can have multiple consumers,
with each consumer assigned to one or more partitions within a topic.
Kafka follows a pull-based model, where consumers request records from

Kafka brokers at their own pace. This decoupling allows consumers to
process data at different rates and provides fault tolerance by
allowing new consumers to join or replace existing ones without
interrupting the flow of data.
By: Waleed Mousa
c. Brokers and Clusters
A broker is a Kafka server responsible for managing one or more

partitions. Each broker is identified by a unique numeric ID and can
handle read and write requests from producers and consumers. Brokers
store the records on disk in a commit log format, providing durability
and fault tolerance.
A collection of Kafka brokers forms a cluster. Clusters provide high

availability and fault tolerance by replicating partitions across
multiple brokers. Each partition has a leader and multiple replicas.
The leader handles all read and write requests for the partition, while
replicas replicate the partition's data for fault tolerance.
d. ZooKeeper
Kafka relies on Apache ZooKeeper for cluster coordination, leader

election, and storing metadata. ZooKeeper is a distributed coordination
service that ensures the consistency and reliability of the Kafka
cluster.
ZooKeeper keeps track of the Kafka brokers, topics, partitions, and

consumer groups. It elects a controller, responsible for managing the
leader replicas for each partition. If the controller fails, ZooKeeper
selects a new controller to maintain the cluster's operations.
e. Replication and Fault Tolerance
Kafka provides replication to ensure high availability and durability

of data. Each partition has multiple replicas, with one replica acting
as the leader and the others as followers. The leader handles all read
and write requests, while the followers replicate the data by receiving
the leader's log and staying in sync.
If a leader fails, one of the followers is elected as the new leader by

ZooKeeper. This automatic leader election ensures continuous operation
even in the presence of failures. Replication also enables data
durability since the data is stored on multiple brokers.
By: Waleed Mousa

f. Message Storage and Retention
Kafka stores messages for a configurable period, which allows consumers

to consume messages at their own pace and replay messages if needed.
The retention period can be set based on time (e.g., 7 days) or size
(e.g., 1 TB).
Once the retention period or size threshold is reached, Kafka

automatically deletes old messages. However, consumers can control
their own offset, meaning they can choose to consume from any point
within a topic, even if the data has been deleted due to retention
policies.
3. Installation and Setup

In this section, we'll guide you through the process of installing and
setting up Apache Kafka. By the end of this section, you'll have a
fully functional Kafka environment ready for development and
exploration.
a. Prerequisites
Before installing Kafka, ensure that you have the following

prerequisites:
1. Java: Kafka is built on Java, so you'll need Java Development Kit

(JDK) installed on your machine. Kafka requires Java 8 or later
versions.
2. Operating System: Kafka can run on various operating systems,
including Linux, macOS, and Windows. Make sure your operating
system is compatible.
b. Downloading Kafka
To download Kafka, follow these steps:
1. Visit the Apache Kafka official website:

https://kafka.apache.org/downloads.
By: Waleed Mousa

2. Choose the desired Kafka version to download. It's recommended to
select the latest stable version.
3. Download the Kafka binary distribution that corresponds to your
operating system.
c. Configuring Kafka
Once you have downloaded Kafka, it's time to configure it. The
configuration files are located in the config directory within the Kafka
installation.
1. Open the server.properties file located in the config directory.

2. Review and modify the following properties as per your
requirements:
○ broker.id: A unique ID for each Kafka broker in the cluster.
○ listeners: The network interface and port on which Kafka
listens for incoming connections.
○ log.dirs: The directory where Kafka stores its data logs.
○ zookeeper.connect: The ZooKeeper connection string for Kafka
to connect to the ZooKeeper ensemble.
3. Save the changes to the server.properties file.
d. Starting ZooKeeper and Kafka Servers
Kafka relies on ZooKeeper for cluster coordination. Therefore, you need

to start a ZooKeeper server before starting Kafka.
● Open a terminal or command prompt.

● Navigate to the Kafka installation directory.
● Start the ZooKeeper server by running the following command:
bin/zookeeper-server-start.sh config/zookeeper.properties
This command starts the ZooKeeper server using the provided
configuration file.
● Once the ZooKeeper server is up and running, open a new terminal
or command prompt.
● Start the Kafka server by running the following command:
By: Waleed Mousa

bin/kafka-server-start.sh config/server.properties
This command starts the Kafka server using the provided configuration
file.
e. Verifying the Installation
To verify that Kafka is installed and running correctly, follow these

steps:
● Open a new terminal or command prompt.

● Create a test topic by running the following command:
bin/kafka-topics.sh --create --topic test-topic --bootstrap-server

localhost:9092 --partitions 1 --replication-factor 1
This command creates a topic named "test-topic" with a single partition
and a replication factor of 1.
● Start a producer to publish messages to the test topic by running
the following command:
bin/kafka-console-producer.sh --topic test-topic --bootstrap-server
localhost:9092
This command starts an interactive console where you can enter messages
to publish.
● Open a new terminal or command prompt.
● Start a consumer to read messages from the test topic by running
the following command:
bin/kafka-console-consumer.sh --topic test-topic --bootstrap-server
localhost:9092 --from-beginning
This command starts a console consumer that reads messages from the
beginning of the topic.
● In the producer terminal, enter a few messages and press Enter to

publish them.
● Observe that the consumer terminal receives and displays the
messages you entered.
By: Waleed Mousa

If you can successfully publish and consume messages, it means that
Kafka is installed and working correctly on your system.
4. Kafka Producers
In Apache Kafka, producers are responsible for publishing (writing)
messages to Kafka topics. In this section, we'll dive into the details
of Kafka producers, covering topics such as producing messages,
serialization, message keys, compression, acknowledgments, error
handling, and more.
a. Producing Messages
To publish messages to a Kafka topic, follow these steps:
● Create a Kafka producer instance by specifying the Kafka bootstrap

servers and the desired configuration properties.
from kafka import KafkaProducer
# Create a Kafka producer instance

producer = KafkaProducer(bootstrap_servers='localhost:9092')
● Use the send() method of the producer to send messages to a topic.
The message should be provided as bytes or a string.
# Send a message to a topic

producer.send('my-topic', b'Hello Kafka!')
● You can send messages with different values to the same or
different topics as needed.
● Close the producer to ensure that all buffered records are flushed
and transmitted to Kafka.
# Close the Kafka producer

producer.close()
b. Serialization and Avro

When producing messages, you often need to serialize your data into a
format that Kafka can handle. By default, Kafka handles byte arrays as
messages. However, you can use serialization to convert your data into
various formats, such as JSON, Avro, or Protobuf.
By: Waleed Mousa

For example, to use Avro serialization, you need to configure the
producer with an Avro serializer:

from confluent_kafka.avro import AvroProducer
from confluent_kafka.avro.serializer import SerializerError
# Configure Avro serializer properties

avro_producer = AvroProducer({
'bootstrap.servers': 'localhost:9092',
'schema.registry.url': 'http://localhost:8081'
})
# Define the Avro schema for the message

schema = {
"type": "record",
"name": "MyMessage",
"fields": [
{"name": "id", "type": "int"},
{"name": "name", "type": "string"}
]
}
# Create a message using the Avro schema

message = {"id": 1, "name": "John Doe"}
# Produce the Avro-encoded message to a topic

avro_producer.produce(topic='my-topic', value=message, value_schema=schema)
avro_producer.flush()
Ensure that you have the necessary Avro libraries installed and provide
the Avro schema and schema registry URL when creating the Avro
producer.
c. Message Keys
When producing messages, you can optionally assign a message key to
each message. The key is a string or bytes that helps determine the
partition to which the message will be written. Kafka guarantees that
messages with the same key will always go to the same partition,
ensuring that messages with a specific key are ordered within a
partition.
To send messages with keys, modify the producer code as follows:
By: Waleed Mousa

# Send a message with a key to a topic
producer.send('my-topic', key=b'key1', value=b'Hello Kafka!')
d. Message Compression
Kafka supports compressing messages to reduce network bandwidth and
storage costs. You can configure the producer to compress messages using
different compression algorithms such as GZIP, Snappy, or LZ4.
To enable compression, set the compression_type property when creating
the producer:
# Create a Kafka producer with compression

producer = KafkaProducer(bootstrap_servers='localhost:9092',
compression_type='gzip')
Kafka will automatically compress messages based on the specified

compression type.
e. Message Partitioning
Kafka partitions allow for parallelism and scalability. By default,
Kafka uses a round-robin partitioning strategy to distribute messages
evenly across partitions. However, you can also implement a custom
partitioner to define how messages should be assigned to partitions
based on specific criteria.
To use a custom partitioner, you need to define a partitioning function
that takes a topic, key, partition count, and optionally other metadata
as input. The function should return the desired partition number:
# Custom partitioning function

def my_partitioner(topic, key, partition_count):
# Implement your custom logic to determine the partition
# Return the partition number
pass
# Create a Kafka producer with a custom partitioner

partitioner=my_partitioner)
By: Waleed Mousa

● Implement the my_partitioner() function according to your
partitioning logic, and assign it to the partitioner parameter
when creating the producer.
f. Producer Acknowledgments
Kafka allows you to configure different levels of producer
acknowledgments to control the durability and reliability of message
publishing. There are three acknowledgment modes:
● acks=0: The producer does not wait for acknowledgment. Messages

are considered sent as soon as they are handed off to the network
buffer.
● acks=1: The producer waits for the leader replica to acknowledge
the message. This mode ensures that messages are not lost if the
leader fails but allows for the possibility of duplicates if the
leader crashes before the acknowledgment is received.
● acks=all/-1: The producer waits for all in-sync replicas to
acknowledge the message. This mode provides the strongest
durability guarantee but introduces higher latency and potential
message duplication in case of failures.
To set the acknowledgment mode, modify the producer configuration:
# Create a Kafka producer with acknowledgment configuration

producer = KafkaProducer(bootstrap_servers='localhost:9092', acks='all')
● Choose the appropriate acknowledgment mode based on your
application's requirements for durability and latency.
g. Error Handling and Retries

When producing messages, it's important to handle and handle errors
gracefully. Kafka provides various mechanisms for error handling and
retries.
● Retrying on Errors: By default, the Kafka producer automatically

retries sending messages on certain types of errors, such as
network failures or leader unavailability. Retries are controlled
By: Waleed Mousa

by the retries configuration property, which specifies the maximum
number of retries.
● Error Handling Callback: You can provide an error handling
callback function to handle specific types of errors that occur
during message production. The callback function is invoked for
each failed message and receives the exception and the message
metadata as parameters.
# Error handling callback function

def error_callback(exc):
# Implement your error handling logic
pass
# Create a Kafka producer with an error handling callback

on_error=error_callback)
Implement the error_callback() function to handle errors according to

your application's requirements.
5. Kafka Consumers
In Apache Kafka, consumers read messages from Kafka topics. In this
section, we'll explore the details of Kafka consumers, covering topics
such as consuming messages, consumer groups, offsets, message
processing semantics, error handling, and rebalancing.
a. Consuming Messages
To consume messages from a Kafka topic, follow these steps:
1. Create a Kafka consumer instance by specifying the Kafka bootstrap

servers and the desired configuration properties.
By: Waleed Mousa

from kafka import KafkaConsumer
# Create a Kafka consumer instance

consumer = KafkaConsumer(bootstrap_servers='localhost:9092')
● Subscribe to one or more topics to receive messages from.

# Subscribe to a single topic
consumer.subscribe(['my-topic'])
# Subscribe to multiple topics

consumer.subscribe(['topic1', 'topic2'])
● Use the poll() method of the consumer to fetch messages from

Kafka. The poll() method returns a batch of messages from the
subscribed topics.
# Fetch messages from Kafka
records = consumer.poll(timeout_ms=1000)
# Iterate over the fetched records

for record in records:
print(record.value)
Process the received messages as needed.

● Optionally, commit the offsets to mark the messages as processed.
This ensures that the consumer starts from the correct position if
it is restarted.
# Commit the offsets

consumer.commit()
Close the consumer when you're done consuming messages.
# Close the Kafka consumer

consumer.close()
By: Waleed Mousa

b. Consumer Groups
Kafka allows multiple consumers to work together as part of a consumer

group. A consumer group provides scalability, fault tolerance, and
parallel processing of messages.
When consumers join a consumer group and subscribe to a topic, Kafka

automatically assigns different partitions of the topic to each
consumer within the group. Each consumer in the group processes
messages from its assigned partitions independently.
To leverage consumer groups, ensure that all consumers within a group

use the same group_id while creating the Kafka consumer:
# Create a Kafka consumer with a group ID

consumer = KafkaConsumer(bootstrap_servers='localhost:9092',
group_id='my-consumer-group')
By having multiple consumers in the same consumer group, you can scale
the processing capacity by adding more consumers. Kafka automatically
balances the partition assignment across the consumers in the group.
c. Offsets and Committing
Kafka keeps track of the offsets that indicate the position of a

consumer within a partition. The offset represents the next message the
consumer will read from.
By default, Kafka consumers automatically manage the offsets and keep

track of the last consumed message. However, you can also manually
control the offsets to have more fine-grained control over message
processing.
To manually commit offsets, follow these steps:
● Disable automatic offset commits when creating the consumer by

setting the enable_auto_commit configuration property to False.
By: Waleed Mousa

# Create a Kafka consumer with manual offset control

enable_auto_commit=False)
● After processing a batch of messages, commit the offsets to mark

them as processed.
# Commit the offsets

consumer.commit()
● By committing the offsets, the consumer's position is updated, and

Kafka knows that the messages have been processed. On the next
poll, the consumer will continue consuming from the next available
offset.
d. Message Ordering and Processing Semantics
Kafka ensures message ordering within a partition, meaning messages

with the same key or messages without a key are processed in the order
they are written to the partition.
However, when using multiple partitions, the ordering of messages

across partitions is not guaranteed. If strict message ordering is
required, consider using a single partition or designing the message
key to ensure ordering.
In terms of processing semantics, Kafka offers three modes:
● At Most Once: The consumer commits the offset immediately after

processing the message. If the consumer fails before committing,
the message is considered processed and may not be processed
again.
● At Least Once: The consumer commits the offset only after ensuring
that the message has been fully processed and acknowledged. If the
consumer fails before committing, the message will be processed
again after recovery.
By: Waleed Mousa

● Exactly Once: This is the strongest guarantee and requires
cooperation between the producer and consumer. It ensures that
each message is processed once and only once, even in the presence
of failures and retries. Achieving exactly-once semantics
typically involves leveraging Kafka's transactional API and
careful application design.
The choice of processing semantics depends on your application's

requirements for data consistency and fault tolerance.
e. Error Handling and Retry Policies
When consuming messages, it's essential to handle and manage errors

gracefully. Kafka consumers provide various mechanisms for error
handling and retries.
● Retrying on Errors: By default, Kafka consumers automatically

handle certain types of errors, such as network issues or leader
unavailability, by retrying the consumption of failed messages.
Retries are controlled by the consumer's configuration properties.
● Error Handling Callback: You can provide an error handling
callback function to handle specific types of errors that occur
during message consumption. The callback function is invoked for
each failed message and receives the exception and the message
metadata as parameters.
# Error handling callback function

def error_callback(exc):
# Implement your error handling logic
pass
# Create a Kafka consumer with an error handling callback

error_callback=error_callback)
● Implement the error_callback() function to handle errors according

to your application's requirements.
By: Waleed Mousa

f. Rebalancing
Consumer groups in Kafka support rebalancing when consumers join or

leave the group. Rebalancing ensures that the partitions are
distributed evenly across the active consumers, even if consumers join
or leave the group dynamically.
During rebalancing, Kafka redistributes the partitions among the

consumers, and each consumer is assigned a fair share of partitions to
process. The rebalancing process is handled automatically by Kafka, and
consumers continue processing messages once the rebalancing is
complete.
Rebalancing may trigger a brief interruption in message processing, so

it's important to design your application to handle rebalancing
gracefully without losing any data.
6. Kafka Streams
Apache Kafka Streams is a powerful library that allows you to build
real-time stream processing applications. It provides a high-level API
for processing and analyzing data streams in a scalable and
fault-tolerant manner. In this section, we'll dive into Kafka Streams,
covering concepts such as stream processing, building stream processing
applications, stateful processing, and exactly-once processing
semantics.
a. Introduction to Kafka Streams
Kafka Streams simplifies the development of stream processing

applications by providing a lightweight and integrated solution within
the Kafka ecosystem. It enables you to process data streams directly
from Kafka topics, perform transformations and aggregations, and
produce output streams back to Kafka or other systems.
By: Waleed Mousa

With Kafka Streams, you can write stream processing applications using
familiar programming constructs, such as filtering, mapping, and
joining, in a way that is scalable and fault-tolerant.
b. Stream Processing Concepts
Before diving into building stream processing applications, let's

understand some key concepts:
● Streams: Streams represent an unbounded sequence of records in

Kafka. Each record consists of a key, a value, and a timestamp.
Streams allow processing of data records as they arrive, enabling
real-time analysis and transformations.
● Tables: Tables are an abstraction over streams that provide a
logical view of the data as a table. They allow you to perform
SQL-like operations, such as filtering, aggregations, and joins.
● Windowing: Windowing is a concept in stream processing that allows
you to group records based on time or other criteria. Windows
enable operations such as time-based aggregations or computations
over fixed-size subsets of records.
● State Stores: State stores provide a mechanism for storing and
maintaining state during stream processing. They allow you to
store and access data that is required for computations or joins
across multiple records.
c. Building Stream Processing Applications
To build a stream processing application using Kafka Streams, follow

these steps:
● Set up the necessary dependencies in your project. Ensure that you

have the Kafka Streams library added to your project's
dependencies.
● Create a StreamsBuilder instance to define the processing topology.
from kafka import KafkaStreams

from kafka import StreamsBuilder
# Create a StreamsBuilder instance
By: Waleed Mousa

builder = StreamsBuilder()
● Define the input and output topics for your application.

input_topic = 'input-topic'
output_topic = 'output-topic'
● Build the processing logic by defining the stream operations using

the builder.
from kafka import Serdes
# Create the input stream from the input topic

input_stream = builder.stream(input_topic)
# Perform stream transformations

transformed_stream = input_stream.filter(lambda key, value: value > 10)
# Write the transformed stream to the output topic

transformed_stream.to(output_topic, key_serde=Serdes.String(),
value_serde=Serdes.Integer())
In this example, the stream is filtered based on a condition, and the

filtered records are written to the output topic.
● Create a KafkaStreams instance with the defined processing

topology.
# Create a KafkaStreams instance
streams = KafkaStreams(builder.build(), bootstrap_servers='localhost:9092')
● Start the Kafka Streams application.
# Start the Kafka Streams application

streams.start()
Once started, the application will continuously process incoming
records from the input topic and produce the results to the output
topic.
● Optionally, handle application termination and cleanup.
By: Waleed Mousa

try:
# Wait for the application to be terminated (e.g., by user interrupt)
streams.await_termination()
finally:
# Clean up resources
streams.close()
d. Stateful Processing and Interactive Queries
Kafka Streams supports stateful processing by providing built-in

mechanisms for managing and maintaining state during stream processing.
Stateful processing allows you to perform complex computations and
aggregations that require information from multiple records.
Kafka Streams also enables interactive queries, which allow you to

query the state stores associated with your stream processing
application. You can query the current state of the application's
tables or perform point lookups based on keys.
Interactive queries are useful for building real-time dashboards,

serving real-time data to external systems, or providing responses to
user queries based on the processed data.
e. Exactly-Once Processing Semantics
Kafka Streams provides support for achieving exactly-once processing

semantics. Exactly-once semantics ensure that each record is processed
exactly once, even in the presence of failures and retries.
To achieve exactly-once semantics, Kafka Streams leverages Kafka's

transactional API and provides end-to-end exactly-once processing
guarantees. This involves atomic processing and transactional writes to
both input and output Kafka topics.
By leveraging Kafka Streams and its capabilities, you can build

powerful and scalable stream processing applications that process
real-time data from Kafka topics. The library simplifies the development
process and provides robust support for stateful processing and
exactly-once semantics.
By: Waleed Mousa

7. Kafka Connect
Apache Kafka Connect is a framework that enables seamless integration
of Kafka with external systems. It simplifies the process of building
and managing data pipelines, allowing you to efficiently import and
export data from Kafka to various data sources and sinks. In this
section, we'll delve into Kafka Connect, covering concepts such as
connectors, source/sink tasks, configuration, monitoring, and scaling.
a. Introduction to Kafka Connect
Kafka Connect acts as a bridge between Kafka and other systems,

enabling reliable and scalable data ingestion and egress. It provides a
framework for easily building, deploying, and managing connectors that
define the integration between Kafka and external systems.
Connectors are responsible for moving data between Kafka and the
connected system. They handle tasks such as data serialization,
transformation, and delivery, ensuring seamless and efficient data
movement.
b. Connectors and Source/Sink Tasks
Connectors in Kafka Connect are divided into source connectors and sink
connectors:
● Source Connectors: Source connectors ingest data from external

systems and publish it as messages to Kafka topics. They enable
capturing data changes or events from various sources and making
them available in Kafka.
● Sink Connectors: Sink connectors consume messages from Kafka
topics and write them to external systems or sinks. They
facilitate streaming data from Kafka to databases, file systems,
message queues, or other systems.
Connectors consist of tasks that are responsible for performing the

actual data movement. Each connector may have one or more tasks running
in parallel to achieve parallelism and scalability.
By: Waleed Mousa

c. Configuring Connectors
To configure Kafka Connect and its connectors, follow these steps:
● Start by defining the Kafka Connect worker configuration. This

configuration specifies the connection details for Kafka, the worker
properties, and other global settings.
● Create a connector configuration file for the specific connector you
want to use. The connector configuration file includes properties
related to the connector, such as the source/sink system
configuration, topics to read from or write to, and data
transformation options.
● Start the Kafka Connect distributed worker and provide the worker
configuration.
bin/connect-distributed.sh config/worker.properties
● Use the Kafka Connect REST API or a connector-specific command to

create, manage, and monitor connectors. The REST API provides
endpoints for creating, updating, and deleting connectors, as well
as retrieving status and configuration information.
d. Example Connectors
Kafka Connect offers a wide range of pre-built connectors that you can
use out of the box. These connectors support integration with popular
systems such as databases, cloud storage services, messaging systems,
and more. Some examples of pre-built connectors include:
● JDBC Connector: Allows you to import and export data between Kafka
and relational databases using Java Database Connectivity (JDBC).
● Elasticsearch Connector: Enables indexing and querying Kafka data
in Elasticsearch, a distributed search and analytics engine.
● Amazon S3 Connector: Facilitates the integration between Kafka and
Amazon Simple Storage Service (S3), allowing you to store and
retrieve data in S3.
● Debezium Connectors: Provides connectors for capturing and
streaming change data from various databases, such as MySQL,
PostgreSQL, MongoDB, and Oracle, into Kafka.
By: Waleed Mousa
These pre-built connectors serve as a starting point for common
integration scenarios. You can also develop custom connectors tailored
to your specific requirements.
e. Monitoring and Scaling Connectors
Kafka Connect provides mechanisms for monitoring and scaling connectors

to handle different workloads and ensure high availability.
● Monitoring: Kafka Connect exposes various metrics and monitoring

capabilities through JMX (Java Management Extensions) or REST
APIs. These metrics allow you to monitor the health, performance,
and status of the connectors, tasks, and workers.
● Scaling: Kafka Connect can scale horizontally by adding more
workers to handle higher loads and improve throughput. Multiple
workers can distribute the connector tasks across different nodes,
enabling parallelism and load balancing.
By leveraging Kafka Connect, you can easily integrate Kafka with

external systems, import data into Kafka from various sources, and
export data from Kafka to different sinks. The framework provides a
standardized and scalable approach for building data pipelines,
reducing complexity and effort in managing data movement.
8. Best Practices and Considerations

To effectively and efficiently use Apache Kafka, it's important to
follow best practices and consider certain factors that impact
performance, scalability, reliability, and security. In this section,
we'll explore some key best practices and considerations for using
Kafka.
a. Topic Design
● Partitioning: Carefully consider the partitioning strategy for

your topics. Partitioning impacts parallelism, scalability, and
By: Waleed Mousa

message ordering. Distribute partitions evenly across brokers to
ensure balanced workload and avoid hotspots.
● Retention Policy: Set an appropriate retention policy based on
your data retention requirements. Consider factors such as storage
capacity, consumer lag, and data archival needs. Regularly monitor
and manage the retention policy to prevent data loss or
unnecessary resource consumption.
● Compact Topics: Use the compaction feature for topics that require
storing the latest version of each record based on a key. Compact
topics are useful for maintaining the latest state of entities or
storing change events.
b. Producer and Consumer Configuration
● Batching: Configure producers and consumers to use batching for

better performance. Batching allows sending or processing multiple
records in a single network request, reducing overhead. Tune the
batch size and linger time based on your workload characteristics.
● Compression: Enable compression in producers and consumers to
reduce network bandwidth and storage requirements. Choose a
compression algorithm based on the workload and available
resources.
● Consumer Group Management: Carefully manage consumer groups and
their offsets. Avoid using large consumer groups as it can result
in slower rebalancing and higher latency. Monitor consumer lag and
adjust the number of consumers and partitions as needed.
c. Monitoring and Operations
● Metrics Monitoring: Monitor Kafka metrics to gain insights into

cluster health, broker performance, producer and consumer
throughput, and other key indicators. Utilize monitoring tools and
frameworks like Kafka's built-in metrics, Prometheus, or
third-party solutions.
● Alerting: Set up alerting mechanisms to proactively detect and
respond to anomalies or issues in the Kafka cluster. Configure
alerts based on metrics thresholds, consumer lag, partition
under-replication, or any other critical events.
By: Waleed Mousa
● Backup and Disaster Recovery: Implement a robust backup and
disaster recovery strategy for Kafka. Regularly backup your Kafka
cluster configurations, offsets, and important topics. Test the
recovery process to ensure you can restore the cluster in case of
failures or disasters.
d. Security
● Encryption: Enable encryption for data in transit by using SSL/TLS

to secure communication between clients and brokers. Consider
using encryption at rest to protect data stored on disk.
● Authentication and Authorization: Implement authentication
mechanisms like SASL or SSL client authentication to ensure only
authorized clients can access Kafka. Configure access control lists
(ACLs) to enforce fine-grained authorization policies.
● Securing ZooKeeper: Protect the ZooKeeper ensemble used by Kafka
for metadata and coordination. Secure ZooKeeper by enabling
authentication, restricting access, and securing network
communication.
These best practices and considerations will help you optimize the
performance, reliability, and security of your Apache Kafka deployment.
Regularly monitor and fine-tune your Kafka configuration based on your
specific requirements and workload characteristics.
By: Waleed Mousa

Kafka

Uploaded by

Copyright:

Available Formats

You might also like

Kafka

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Kafka

Uploaded by

Copyright:

Available Formats

# APACHE KAFKA : 101

1. Overview of Apache Kafka

Apache Kafka is a distributed streaming platform that provides a highly

Kafka is designed to handle high-volume and high-throughput data

Apache Kafka offers several key features that make it a powerful

1. Publish-Subscribe Model: Kafka follows a publish-subscribe

By: Waleed Mousa

Apache Kafka finds applications in various domains due to its

By: Waleed Mousa

a. Topics and Partitions

In Kafka, data is organized into topics. A topic is a category or

Each partition is an ordered and immutable sequence of records. When a

The partitioning of data across multiple partitions allows for

b. Producers and Consumers

Producers are responsible for publishing (writing) records to Kafka

Kafka follows a pull-based model, where consumers request records from

A broker is a Kafka server responsible for managing one or more

A collection of Kafka brokers forms a cluster. Clusters provide high

Kafka relies on Apache ZooKeeper for cluster coordination, leader

ZooKeeper keeps track of the Kafka brokers, topics, partitions, and

e. Replication and Fault Tolerance

Kafka provides replication to ensure high availability and durability

If a leader fails, one of the followers is elected as the new leader by

By: Waleed Mousa

Kafka stores messages for a configurable period, which allows consumers

Once the retention period or size threshold is reached, Kafka

3. Installation and Setup

Before installing Kafka, ensure that you have the following

1. Java: Kafka is built on Java, so you'll need Java Development Kit

To download Kafka, follow these steps:

1. Visit the Apache Kafka official website:

By: Waleed Mousa

1. Open the server.properties file located in the config directory.

d. Starting ZooKeeper and Kafka Servers

Kafka relies on ZooKeeper for cluster coordination. Therefore, you need

● Open a terminal or command prompt.

By: Waleed Mousa

e. Verifying the Installation

To verify that Kafka is installed and running correctly, follow these

● Open a new terminal or command prompt.

bin/kafka-topics.sh --create --topic test-topic --bootstrap-server

● In the producer terminal, enter a few messages and press Enter to

By: Waleed Mousa

● Create a Kafka producer instance by specifying the Kafka bootstrap

from kafka import KafkaProducer

# Create a Kafka producer instance

# Send a message to a topic

# Close the Kafka producer

b. Serialization and Avro

By: Waleed Mousa

from kafka import KafkaProducer

# Configure Avro serializer properties

# Define the Avro schema for the message

# Create a message using the Avro schema

# Produce the Avro-encoded message to a topic

By: Waleed Mousa

from kafka import KafkaProducer

# Create a Kafka producer with compression

Kafka will automatically compress messages based on the specified