Professional Documents
Culture Documents
Kafka Patterns and Anti-Patterns
Kafka Patterns and Anti-Patterns
CONTENTS
Apache Kafka
Ecosystem
Apache Kafka is a distributing streaming platform that originated at Figure 1: Topic and Partitions in Kafka
LinkedIn and has been a top-level Apache Software Foundation (ASF)
project since 2012. At its core, Apache Kafka is a message broker that
allows clients to publish and read streams of data (also called events). It
has an ecosystem of open-source components, which, when combined,
help you store these data streams, process, and integrate them with
other parts of your system in a secure, reliable, and scalable manner.
After a brief overview, this Refcard dives into select patterns and anti-
patterns spanning across Kafka Client APIs, Kafka Connect, and Kafka
Streams, covering topics such as reliable messaging, scalability, error
handling, and more.
Source: https://kafka.apache.org/documentation/#introduction
OVERVIEW OF THE APACHE KAFKA
ECOSYSTEM
A Kafka broker (also referred to as a node) is the fundamental building
block that runs the Kafka JVM process. Although a single Kafka broker will
suffice for development purposes, production systems typically have three
or more brokers (odd numbers 5, 7, etc.) for high availability and scalability.
These groups of Kafka brokers form a cluster, and each cluster has a leader
along with follower nodes that replicate data from the leader.
Here is a summary of the key projects that are part of core Kafka: A related configuration is min.in.sync.replicas. Guidance on this
1. Kafka Client APIs — Producer and Consumer APIs that allow topic will be covered later in this Refcard.
external systems to write data to and read data from Kafka topics,
NO MORE DUPLICATES
respectively. Kafka has client libraries in many programming
The producer needs to be idempotent because
languages, with the Java client being part of the core Kafka project.
GOAL your application cannot tolerate duplicate
2. Kafka Connect — Provides a high-level framework to build messages.
connectors that help integrate Kafka with external systems. They PATTERN Set enable.idempotence=true.
allow us to move data from external systems to Kafka topics
ANTI-PATTERN Using a default configuration.
(Source connector) and from Kafka topics into external systems
(Sink connector). Popular examples of connectors include the
JDBC connector and Debezium. It is possible that the producer application may end up sending the
same message to Kafka more than once. Imagine a scenario where the
3. Kafka Streams — A standalone Java library that provides
message is actually received by the leader (and replicated to in-sync
distributed stream processing primitives on top of data in Kafka
replicas if acks=all is used), but the application does not receive the
topics. It provides high-level APIs (DSL and Processor), which
acknowledgement from the leader due to request timeout, or maybe
you can create topologies to execute stateless transformations
the leader node just crashed. The producer will try to resend the
(map, filter, etc.) as well as stateful computations (join,
message — if it succeeds, you will end up with duplicate messages in
aggregations, etc.) on streaming data.
Kafka. Depending upon your downstream systems, this may not be
acceptable.
COMMON APACHE KAFKA PATTERNS
AND ANTI-PATTERNS The Producer API provides a simple way to avoid this by using the
This section will cover some of the common patterns — along with their
enable.idempotence property (which is set to false by default).
respective anti-patterns — for Kafka Producer and Consumer APIs,
When set to true, the producer attaches a sequence number to every
Kafka Connect, and Kafka Streams.
message. This is validated by the broker so that a message with a
duplicate sequence number will get rejected.
KAFKA CLIENT API – PRODUCER
The Kafka Producer API sends data to topics in a Kafka cluster. Here are From Apache Kafka 3.0 onwards, acks=all and enable.
a couple of patterns and anti-patterns to consider: idempotence=true are set by default, thereby providing strong delivery
guarantees for producer.
RELIABLE PRODUCER
While producing a message, you want to ensure KAFKA CLIENT API – CONSUMER
GOAL
that it has been sent to Kafka. With the Kafka Consumer API, applications can read data from topics
PATTERN Use the acks=all Configuration for producer. in a Kafka cluster.
ANTI-PATTERN Using the default configuration (acks = 1). IDLE CONSUMERS INSTANCES
GOAL Scale out your data processing pipeline.
The acks property allows producer applications to specify the number
Run multiple instances of your consumer
of acknowledgments that the leader node should have received before PATTERN
application.
considering a request complete. If you don’t provide one explicitly,
Number of consumer instances is more than the
acks=1 is used by default. The client application will receive an ANTI-PATTERN
number of topic partitions.
acknowledgment as soon as the leader node receives the message and
writes it to its local log. If the message has not yet been replicated to
A Kafka consumer group is a set of consumers that ingest data from
follower nodes and the current leader node, it will result in data loss.
one or more topics. The topic partitions are load-balanced among
If you set acks=all (or -1), your application only receives a successful consumers in the group. This load distribution is managed on the fly
confirmation when all the in-sync replicas in the cluster have when new consumer instances are added or removed from a consumer
acknowledged the message. There is a trade-off between latency and group. For example, if there are ten topic partitions and five consumers
reliability/durability here: Waiting for acknowledgment from all the in a consumer group for that topic, Kafka will make sure that each
in-sync replicas will incur more time, but the message will not be lost as consumer instance receives data from two topic partitions of the topic.
long as at least one of the in-sync replicas is available.
You can end up with a mismach between the number of consumer Data loss: The consumer app has read the messages for offsets 198,
instances and topic partitions. This could be due to incorrect topic 199, and 200. The auto-commit process commits these offsets before
configuration, wherein the number of partitions is set to one. Or, maybe the application is able to actually process these messages (perhaps
your consumer applications are packaged using Docker and operated through some transformation and store the result in a downstream
on top of an orchestration platform such as Kubernetes, which can, in system), and the consumer app crashes. In this situation, the new
turn, be configured to auto-scale them. consumer app instance will see that the last committed offset is 200
and will continue reading new messages from thereon. Messages from
Keep in mind: You might end up with more instances than partitions.
offsets 198, 199, and 200 were effectively lost.
You need to be mindful of the fact that such instances remain inactive
and do not participate in processing data from Kafka. Thus, the degree To have greater control over the commit process, you need to explicitly
of consumer parallelism is directly proportional to the number of topic set enable.auto.commit to false and handle the commit process
partitions. In the best-case scenario, for a topic with N partitions, you manually. The manual commit API offers synchronous and asynchronous
can have N instances in a consumer group, each processing data from a options, and as expected, each of these has its trade-offs.
single topic partition.
The code block below shows how to explicitly commit the offset for each
Figure 2: Inactive consumers message using the synchronous API:
try {
while(running) {
ConsumerRecords<String, String>
records = consumer.poll(Duration.ofMillis(Long.MAX_
VALUE));
for (TopicPartition partition :
records.partitions()) {
List<ConsumerRecord<String, String>>
partitionRecords = records.records(partition);
for (ConsumerRecord<String, String>
Although JSON is a common message format, it does not have a strict Another thing to be careful about is using the same configuration for
schema associated with it. By design, Kafka producers and consumer both source and sink connectors. Not doing so will cause issues. For
apps are decoupled from each other. Imagine a scenario where your example, if you produce messages without a schema and use value.
producer applications introduce additional fields to the JSON payload/ converter.schemas.enable=true in your sink configuration, Kafka
events, and your downstream consumer applications are not equipped Connect will fail to process those messages.
to handle that and hence fail — this can break your entire data
processing pipeline. ERROR HANDLING IN K AFK A CONNECT
GOAL Handle errors in your Kafka Connect data pipeline.
For a production-grade Kafka Connect setup, it’s imperative that you
PATTERN Use a dead-letter queue.
use a Schema Registry to provide a contract between producers and
Using the default configuration, thereby ignoring
consumers while still keeping them decoupled. For source connectors, ANTI-PATTERN
errors.
if you want data to be fetched from an external system and stored
in Kafka as JSON, you should configure the connector to point to a
When you’re stitching together multiple systems using Kafka and
Schema Registry and also use an appropriate converter.
building complex data processing pipelines, errors are inevitable. It’s
For example: important to plan on how you want to handle them, depending on your
specific requirements. Apart from exceptional scenarios, you don’t
value.converter=<fully qualified class name of json
want your data pipeline to terminate just because there was an error.
schema converter implementation>
But, by default, Kafka Connect is configured to do exactly that:
value.converter.schema.registry.url=<schema registry
endpoint e.g. http://localhost:8081>
errors.tolerance=none
When reading data from Kafka topic, the sink connector also needs the
It does what it says and does not tolerate any errors — the Kafka
same configuration as above.
Connect task shuts down as soon as it encounters an error. To avoid
However, if you are not using a Schema Registry, the next best option is this, you can use:
to use JSON converter implementation, which is native to Kafka. In this
case, you would configure your source and sink connectors as follows: errors.tolerance=all
value.converter=org.apache.kafka.connect.json.
But this is not useful in isolation. You should also configure your
JsonConverter
connector to use a dead-letter queue, a topic to which Kafka Connect
value.converter.schemas.enable=true
can automatically route messages it failed to process. You just need to
provide a name for that topic in the Kafka Connect config:
Thanks to value.converter.schemas.enable=true, the source
connector will add an embedded schema payload to each of your JSON
errors.tolerance=all
messages, and the sink connector will respect that schema as well. An errors.deadletterqueue.topic.name=<name of the
obvious drawback here is the fact that you have schema information topic>
in every message. This will increase the size of the message and can
impact latency, performance, costs, etc. As always, this is a trade-off Since it’s a standard Kafka topic, you have flexibility in terms of how
that you need to accept. you want to introspect and potentially (re)process failed messages.
Additionally, you would also want these to surface in your Kafka
If the above is unacceptable, you will need to make a different trade-off
Connect logs. To enable this, add the following config:
with the following configuration:
value.converter=org.apache.kafka.connect.json. errors.log.enable=true
JsonConverter
value.converter.schemas.enable=false An even better option would be to embed the failure reason in the
message. All you need is to add this configuration:
Now, your JSON messages will be treated as ordinary strings, hence
prone to the aforementioned risks to your data processing pipeline. errors.deadletterqueue.context.headers.enable=true
Evolving the structure of your messages will involve scrutinizing and (re)
developing your consumer apps to ensure they don’t break in response This will provide additional context and details about the error so that
to changes — you need to constantly keep them in-sync (manually). you can use it in your re-processing logic.
KAFKA STREAMS A requirement for stream processing apps is to be able to access an external
This section introduces some advanced options to help the Kafka SQL database often to enrich streaming data with additional information.
Streams library for large-scale stream processing scenarios. For example, it will fetch customer details from an existing customer's
table to supplement and enrich the stream of order information.
REBALANCES AND THEIR IMPACT ON INTERACTIVE
The obvious solution is to query the database to get the information
QUERIES
and add it to the existing stream record.
Large state stores to minimize recovery/
GOAL public Customer getCustomerInfo(String custID) {
migration time during rebalance.
//query customers table in a database
}
PATTERN Use standby replicas.
…..
ANTI-PATTERN Using the default configuration.
GENERAL This way, Kafka will wait for acknowledgement from two in-sync replica
The following patterns apply to Kafka in general and are not specific to nodes (including the leader), which implies that you can withstand
Kafka Streams, Kafka Connect, etc. loss of one broker before Kafka stops accepting writes (due to lack of
minimum in-sync replicas).
AUTOMATIC TOPIC CREATION – BOON OR BANE?
default, and it creates topics with default settings such as: • Kafka broker configuration – https://kafka.apache.org/
Keeping automatic topic creation enabled also means that you can end • How to build Kafka connectors – https://kafka.apache.org/
up with unwanted topics in your cluster. The reason is that topics (that documentation/#connect_development
ABHISHEK GUPTA,
Automatic topic creation creates topics with a cleanup policy set to
PRINCIPAL DEVELOPER ADVOCATE, AWS
delete. This means that if you wanted to create a log-compacted topic,
you will be in for a surprise! Over the course of his career, Abhishek has worn
multiple hats including engineering, product
management and developer advocacy. Most of his
HOW MANY IN-SYNC REPLICAS DO YOU NEED? work has revolved around open-source technologies
While producing a message, be sure that it has including distributed data systems and cloud-native platforms.
GOAL Abhishek is also an open-source contributor and avid blogger.
been sent to Kafka.