Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

378 BROUGHT TO YOU IN PARTNERSHIP WITH

CONTENTS

•  Overview of the Apache Kafka

Apache Kafka
Ecosystem

•  Common Apache Kafka Patterns


and Anti-Patterns

Patterns and Anti-Patterns


−  Kafka Client API – Producer
−  Kafka Client API – Consumer
−  Kafka Connect
−  Kafka Streams
−  General
•  Conclusion
ABHISHEK GUPTA
PRINCIPAL DEVELOPER ADVOCATE, AMAZON WEB SERVICES (AWS)

Apache Kafka is a distributing streaming platform that originated at Figure 1: Topic and Partitions in Kafka
LinkedIn and has been a top-level Apache Software Foundation (ASF)
project since 2012. At its core, Apache Kafka is a message broker that
allows clients to publish and read streams of data (also called events). It
has an ecosystem of open-source components, which, when combined,
help you store these data streams, process, and integrate them with
other parts of your system in a secure, reliable, and scalable manner.

After a brief overview, this Refcard dives into select patterns and anti-
patterns spanning across Kafka Client APIs, Kafka Connect, and Kafka
Streams, covering topics such as reliable messaging, scalability, error
handling, and more.

Source: https://kafka.apache.org/documentation/#introduction
OVERVIEW OF THE APACHE KAFKA
ECOSYSTEM
A Kafka broker (also referred to as a node) is the fundamental building
block that runs the Kafka JVM process. Although a single Kafka broker will
suffice for development purposes, production systems typically have three
or more brokers (odd numbers 5, 7, etc.) for high availability and scalability.
These groups of Kafka brokers form a cluster, and each cluster has a leader
along with follower nodes that replicate data from the leader.

Client applications send messages to Kafka, and each message consists


of a key (which can be null) and a value. These messages are stored
and organized into topics. They are written to a topic in an append-
only fashion (much like a commit log), i.e., a new message always ends
up at the end of a topic. A topic has one or more partitions, and each
message is placed in a specific partition based on the result of hash
function on the key (if the key is null, the message is directed to a
partition in a round-robin manner). Data in each of these partitions is
replicated across the Kafka cluster.

REFCARD | JUNE 2022 1


REFCARD | APACHE K AFK A PATTERNS AND ANTI-PATTERNS

Here is a summary of the key projects that are part of core Kafka: A related configuration is min.in.sync.replicas. Guidance on this
1. Kafka Client APIs — Producer and Consumer APIs that allow topic will be covered later in this Refcard.
external systems to write data to and read data from Kafka topics,
NO MORE DUPLICATES
respectively. Kafka has client libraries in many programming
The producer needs to be idempotent because
languages, with the Java client being part of the core Kafka project.
GOAL your application cannot tolerate duplicate
2. Kafka Connect — Provides a high-level framework to build messages.

connectors that help integrate Kafka with external systems. They PATTERN Set enable.idempotence=true.
allow us to move data from external systems to Kafka topics
ANTI-PATTERN Using a default configuration.
(Source connector) and from Kafka topics into external systems
(Sink connector). Popular examples of connectors include the
JDBC connector and Debezium. It is possible that the producer application may end up sending the
same message to Kafka more than once. Imagine a scenario where the
3. Kafka Streams — A standalone Java library that provides
message is actually received by the leader (and replicated to in-sync
distributed stream processing primitives on top of data in Kafka
replicas if acks=all is used), but the application does not receive the
topics. It provides high-level APIs (DSL and Processor), which
acknowledgement from the leader due to request timeout, or maybe
you can create topologies to execute stateless transformations
the leader node just crashed. The producer will try to resend the
(map, filter, etc.) as well as stateful computations (join,
message — if it succeeds, you will end up with duplicate messages in
aggregations, etc.) on streaming data.
Kafka. Depending upon your downstream systems, this may not be
acceptable.
COMMON APACHE KAFKA PATTERNS
AND ANTI-PATTERNS The Producer API provides a simple way to avoid this by using the
This section will cover some of the common patterns — along with their
enable.idempotence property (which is set to false by default).
respective anti-patterns — for Kafka Producer and Consumer APIs,
When set to true, the producer attaches a sequence number to every
Kafka Connect, and Kafka Streams.
message. This is validated by the broker so that a message with a
duplicate sequence number will get rejected.
KAFKA CLIENT API – PRODUCER
The Kafka Producer API sends data to topics in a Kafka cluster. Here are From Apache Kafka 3.0 onwards, acks=all and enable.
a couple of patterns and anti-patterns to consider: idempotence=true are set by default, thereby providing strong delivery
guarantees for producer.
RELIABLE PRODUCER
While producing a message, you want to ensure KAFKA CLIENT API – CONSUMER
GOAL
that it has been sent to Kafka. With the Kafka Consumer API, applications can read data from topics
PATTERN Use the acks=all Configuration for producer. in a Kafka cluster.

ANTI-PATTERN Using the default configuration (acks = 1). IDLE CONSUMERS INSTANCES
GOAL Scale out your data processing pipeline.
The acks property allows producer applications to specify the number
Run multiple instances of your consumer
of acknowledgments that the leader node should have received before PATTERN
application.
considering a request complete. If you don’t provide one explicitly,
Number of consumer instances is more than the
acks=1 is used by default. The client application will receive an ANTI-PATTERN
number of topic partitions.
acknowledgment as soon as the leader node receives the message and
writes it to its local log. If the message has not yet been replicated to
A Kafka consumer group is a set of consumers that ingest data from
follower nodes and the current leader node, it will result in data loss.
one or more topics. The topic partitions are load-balanced among
If you set acks=all (or -1), your application only receives a successful consumers in the group. This load distribution is managed on the fly
confirmation when all the in-sync replicas in the cluster have when new consumer instances are added or removed from a consumer
acknowledged the message. There is a trade-off between latency and group. For example, if there are ten topic partitions and five consumers
reliability/durability here: Waiting for acknowledgment from all the in a consumer group for that topic, Kafka will make sure that each
in-sync replicas will incur more time, but the message will not be lost as consumer instance receives data from two topic partitions of the topic.
long as at least one of the in-sync replicas is available.

REFCARD | JUNE 2022 3 BROUGHT TO YOU IN PARTNERSHIP WITH


REFCARD | APACHE K AFK A PATTERNS AND ANTI-PATTERNS

You can end up with a mismach between the number of consumer Data loss: The consumer app has read the messages for offsets 198,
instances and topic partitions. This could be due to incorrect topic 199, and 200. The auto-commit process commits these offsets before
configuration, wherein the number of partitions is set to one. Or, maybe the application is able to actually process these messages (perhaps
your consumer applications are packaged using Docker and operated through some transformation and store the result in a downstream
on top of an orchestration platform such as Kubernetes, which can, in system), and the consumer app crashes. In this situation, the new
turn, be configured to auto-scale them. consumer app instance will see that the last committed offset is 200
and will continue reading new messages from thereon. Messages from
Keep in mind: You might end up with more instances than partitions.
offsets 198, 199, and 200 were effectively lost.
You need to be mindful of the fact that such instances remain inactive
and do not participate in processing data from Kafka. Thus, the degree To have greater control over the commit process, you need to explicitly
of consumer parallelism is directly proportional to the number of topic set enable.auto.commit to false and handle the commit process
partitions. In the best-case scenario, for a topic with N partitions, you manually. The manual commit API offers synchronous and asynchronous
can have N instances in a consumer group, each processing data from a options, and as expected, each of these has its trade-offs.
single topic partition.
The code block below shows how to explicitly commit the offset for each
Figure 2: Inactive consumers message using the synchronous API:

try {
while(running) {
ConsumerRecords<String, String>
records = consumer.poll(Duration.ofMillis(Long.MAX_
VALUE));
for (TopicPartition partition :
records.partitions()) {
List<ConsumerRecord<String, String>>
partitionRecords = records.records(partition);
for (ConsumerRecord<String, String>

COMMITTING OFFSETS: AUTOMATIC OR MANUAL? record : partitionRecords) {

Avoid duplicates and/or data loss while System.out.println(record.offset() + ": " +


GOAL
processing data from Kafka.
record.value());
}
Set enable.auto.commit to false and use
PATTERN long lastOffset = partitionRecords.
manual offset management.
get(partitionRecords.size() - 1).offset();
Using default configuration with automatic offset
ANTI-PATTERN consumer.commitSync(Collections.
management.
singletonMap(partition, new
OffsetAndMetadata(lastOffset + 1)));
Consumers acknowledge the receipt (and processing) of messages }
by committing the offset of the message they have read. By default, }
} finally {
enable.auto.commit is set to true for consumer apps, which
consumer.close();
implies that the offsets are automatically committed asynchronously
}
(for example, by a background thread in the Java consumer client) at
regular intervals (defined by auto.commit.interval.ms property
KAFKA CONNECT
that defaults to 5 seconds). While this is convenient, it allows for data
Thanks to the Kafka Connect API, there are a plethora of ready-to-use
loss and/or duplicate message processing.
connectors. But you need to be careful about some of its caveats:
Duplicate messages: Consider a scenario where the consumer app has
HANDLING JSON MESSAGES
read and processed messages from offsets 198, 199, and 200 of a topic
partition — and the automatic commit process was able to successfully GOAL Read/write JSON messages from/to Kafka using
Kafka Connect.
commit offset 198 but then crashed/shutdown after that. This will
trigger a rebalance to another consumer app instance (if available), PATTERN Use a Schema Registry and appropriate JSON
schema converter implementation.
and it will look for the last committed offset, which in this case was 198.
Hence, the messages at offsets 199 and 200 will be redelivered to the ANTI-PATTERN
Embedding a schema with every JSON message
or not enforcing a schema at all.
consumer app.

REFCARD | JUNE 2022 4 BROUGHT TO YOU IN PARTNERSHIP WITH


REFCARD | APACHE K AFK A PATTERNS AND ANTI-PATTERNS

Although JSON is a common message format, it does not have a strict Another thing to be careful about is using the same configuration for
schema associated with it. By design, Kafka producers and consumer both source and sink connectors. Not doing so will cause issues. For
apps are decoupled from each other. Imagine a scenario where your example, if you produce messages without a schema and use value.
producer applications introduce additional fields to the JSON payload/ converter.schemas.enable=true in your sink configuration, Kafka
events, and your downstream consumer applications are not equipped Connect will fail to process those messages.
to handle that and hence fail — this can break your entire data
processing pipeline. ERROR HANDLING IN K AFK A CONNECT
GOAL Handle errors in your Kafka Connect data pipeline.
For a production-grade Kafka Connect setup, it’s imperative that you
PATTERN Use a dead-letter queue.
use a Schema Registry to provide a contract between producers and
Using the default configuration, thereby ignoring
consumers while still keeping them decoupled. For source connectors, ANTI-PATTERN
errors.
if you want data to be fetched from an external system and stored
in Kafka as JSON, you should configure the connector to point to a
When you’re stitching together multiple systems using Kafka and
Schema Registry and also use an appropriate converter.
building complex data processing pipelines, errors are inevitable. It’s
For example: important to plan on how you want to handle them, depending on your
specific requirements. Apart from exceptional scenarios, you don’t
value.converter=<fully qualified class name of json
want your data pipeline to terminate just because there was an error.
schema converter implementation>
But, by default, Kafka Connect is configured to do exactly that:
value.converter.schema.registry.url=<schema registry
endpoint e.g. http://localhost:8081>
errors.tolerance=none

When reading data from Kafka topic, the sink connector also needs the
It does what it says and does not tolerate any errors — the Kafka
same configuration as above.
Connect task shuts down as soon as it encounters an error. To avoid
However, if you are not using a Schema Registry, the next best option is this, you can use:
to use JSON converter implementation, which is native to Kafka. In this
case, you would configure your source and sink connectors as follows: errors.tolerance=all

value.converter=org.apache.kafka.connect.json.
But this is not useful in isolation. You should also configure your
JsonConverter
connector to use a dead-letter queue, a topic to which Kafka Connect
value.converter.schemas.enable=true
can automatically route messages it failed to process. You just need to
provide a name for that topic in the Kafka Connect config:
Thanks to value.converter.schemas.enable=true, the source
connector will add an embedded schema payload to each of your JSON
errors.tolerance=all
messages, and the sink connector will respect that schema as well. An errors.deadletterqueue.topic.name=<name of the
obvious drawback here is the fact that you have schema information topic>
in every message. This will increase the size of the message and can
impact latency, performance, costs, etc. As always, this is a trade-off Since it’s a standard Kafka topic, you have flexibility in terms of how
that you need to accept. you want to introspect and potentially (re)process failed messages.
Additionally, you would also want these to surface in your Kafka
If the above is unacceptable, you will need to make a different trade-off
Connect logs. To enable this, add the following config:
with the following configuration:

value.converter=org.apache.kafka.connect.json. errors.log.enable=true
JsonConverter
value.converter.schemas.enable=false An even better option would be to embed the failure reason in the
message. All you need is to add this configuration:
Now, your JSON messages will be treated as ordinary strings, hence
prone to the aforementioned risks to your data processing pipeline. errors.deadletterqueue.context.headers.enable=true
Evolving the structure of your messages will involve scrutinizing and (re)
developing your consumer apps to ensure they don’t break in response This will provide additional context and details about the error so that
to changes — you need to constantly keep them in-sync (manually). you can use it in your re-processing logic.

REFCARD | JUNE 2022 5 BROUGHT TO YOU IN PARTNERSHIP WITH


REFCARD | APACHE K AFK A PATTERNS AND ANTI-PATTERNS

KAFKA STREAMS A requirement for stream processing apps is to be able to access an external

This section introduces some advanced options to help the Kafka SQL database often to enrich streaming data with additional information.

Streams library for large-scale stream processing scenarios. For example, it will fetch customer details from an existing customer's
table to supplement and enrich the stream of order information.
REBALANCES AND THEIR IMPACT ON INTERACTIVE
The obvious solution is to query the database to get the information
QUERIES
and add it to the existing stream record.
Large state stores to minimize recovery/
GOAL public Customer getCustomerInfo(String custID) {
migration time during rebalance.
//query customers table in a database
}
PATTERN Use standby replicas.
…..
ANTI-PATTERN Using the default configuration.

KStream<String, Order> orders = builder.


stream(“orders-topic”); //input KStream contains
Kafka Streams provides state stores to support stateful stream
customer ID (String) and Order info (POJO)
processing semantics — these can be combined with interactive
queries to build powerful applications whose local state can be //enrich order data
accessed externally (by an RPC layer such as HTTP or gRPC API). orders.forEach((custID, order) -> {
Customer cust = getCustomerInfo(custID);
These state stores are fault-tolerant since their data is replicated to order.setCustomerEmail(cust.getEmail());
changelog topics in Kafka, and updates to the state stores are tracked });

and kept up-to-date in Kafka. In case of failure or restart of a Kafka


orders.to(“orders-enriched-topic”); //write to new
Streams app instance, new or existing instances fetch the state store
topic
data from Kafka. As a result, you can continue to query your application
state using interactive queries.
This is not a viable choice, especially for medium- to large-scale
applications. The latency incurred for the database invocation for each and
However, depending on the data volume, these state stores can get
every record in your stream will most likely create pressure on downstream
quite large (in order of 10s of GBs). A rebalance event in such a case will
applications and affect the overall performance SLA of your system.
result in a large amount of data being replayed and/or restored from
the changelog topics — this can take a lot of time. However, during this The preferred way of achieving this is via a stream-table join. First, you
timeframe, the state of your Kafka Streams apps in not available via will need to source the data (and subsequent changes to it) in the SQL
interactive queries. It’s similar to a “stop-the-world” situation, during database into Kafka. This can be done by writing a traditional client
the JVM garbage collection. Depending upon your use case, the non- application to query and push data into Kafka using the Producer API —
availability of state stores might be unacceptable. but a better solution is to use a Kafka Connect connector such as JDBC
source, or even better, a CDC-based connector such as Debezium.
To minimize the downtime in such cases, you can enable standby
replicas for your Kafka Streams application. By setting the num. Once the data is in Kafka topics, you can use a KTable to read that
standby.replicas config (defaults to 0), you can ask Kafka Streams data into the local state store. This also takes care of updating the local
to maintain additional instances that simply keep a backup of the state state store since we have a pipeline already created wherein database
stores of your active app instances (by reading it from the changelog changes will be sent to Kafka. Now, our KStream can access this local
topics in Kafka). In case of rebalance due to restart or failure, these state store to enrich the streaming data with additional content — this
standby replicas act as “warm” backups and are available for serving is much more efficient than remote database queries.
interactive queries — this reduces the failover time duration.
KStream<String, Order> orders = ...;
KTable<String, Customer> customers = ...;
STREAM-TABLE JOIN IN K AFK A STREAMS
KStream<String, Order> enriched = orders.
Enrich streaming data in your Kafka Streams
GOAL join(customers,
application.
(order, cust) -> {
order.setCustomerEmail(cust.getEmail());
PATTERN Use stream-table join.
return order;
Invoking external data store(s) for every event in }
ANTI-PATTERN );
the stream.
enriched.to(“orders-enriched-topic”);

REFCARD | JUNE 2022 6 BROUGHT TO YOU IN PARTNERSHIP WITH


REFCARD | APACHE K AFK A PATTERNS AND ANTI-PATTERNS

GENERAL This way, Kafka will wait for acknowledgement from two in-sync replica
The following patterns apply to Kafka in general and are not specific to nodes (including the leader), which implies that you can withstand
Kafka Streams, Kafka Connect, etc. loss of one broker before Kafka stops accepting writes (due to lack of
minimum in-sync replicas).
AUTOMATIC TOPIC CREATION – BOON OR BANE?

Use create topics, keeping reliability and high-


CONCLUSION
GOAL Apache Kafka has a rich ecosystem of projects and APIs. Each of these
availability in mind.
offers a lot of flexibility in the form of configuration options such that you
Disable automatic topic creation and provide
PATTERN can tune these components to best fit your use case and requirements.
explicit configuration while creating topics.
This Refcard covered a few of them. I encourage you to refer to the official
ANTI-PATTERN Relying on automatic topic creation. Apache Kafka documentation for a deep dive into these areas.

REFERENCES AND RESOURCES


Kafka topic configuration properties (such as replication factor, partition
•  Kafka Java client documentation – https://javadoc.io/doc/org.
count, etc.) have a server default that you can optionally override on
apache.kafka/kafka-clients/latest/index.html
a per-topic basis. In the absence of explicit configuration, the server
default is used. This is why you need to be mindful of the auto.create. •  Kafka Streams developer manual – https://kafka.apache.org/31/

topics.enable configuration of your Kafka broker. It is set to true by documentation/streams/developer-guide

default, and it creates topics with default settings such as: •  Kafka broker configuration – https://kafka.apache.org/

1. The replication factor is set to 1 — this is not good from a documentation/#brokerconfigs

high-availability and reliability perspective. The recommended •  Topic configurations – https://kafka.apache.org/


replication factor is 3, so that your system can tolerate the loss documentation/#topicconfigs
of two brokers. •  Producer configurations – https://kafka.apache.org/
2. The partition count is set to 1 — this severely limits the documentation/#producerconfigs
performance of your Kafka client apps. For example, you can only •  Consumer configurations – https://kafka.apache.org/
have one instance of a consumer app (in a consumer group). documentation/#consumerconfigs

Keeping automatic topic creation enabled also means that you can end •  How to build Kafka connectors – https://kafka.apache.org/

up with unwanted topics in your cluster. The reason is that topics (that documentation/#connect_development

don’t yet exist) referenced by a producer application and/or subscribed


to by a consumer application will automatically get created.

ABHISHEK GUPTA,
Automatic topic creation creates topics with a cleanup policy set to
PRINCIPAL DEVELOPER ADVOCATE, AWS
delete. This means that if you wanted to create a log-compacted topic,
you will be in for a surprise! Over the course of his career, Abhishek has worn
multiple hats including engineering, product
management and developer advocacy. Most of his
HOW MANY IN-SYNC REPLICAS DO YOU NEED? work has revolved around open-source technologies
While producing a message, be sure that it has including distributed data systems and cloud-native platforms.
GOAL Abhishek is also an open-source contributor and avid blogger.
been sent to Kafka.

Specify minimum in-sync replicas along with


PATTERN
acks configuration.
600 Park Offices Drive, Suite 300
Research Triangle Park, NC 27709
ANTI-PATTERN Only relying on acks configuration. 888.678.0399 | 919.678.0300

At DZone, we foster a collaborative environment that empowers developers and


tech professionals to share knowledge, build skills, and solve problems through
When tuning your producer application for strong reliability, the min. content, code, and community. We thoughtfully — and with intention — challenge
the status quo and value diverse perspectives so that, as one, we can inspire
insync.replicas configuration works hand in hand with the acks positive change through technology.
property (discussed earlier in this Refcard). It’s a broker-level configuration
that can be overridden at the topic level and whose value is set to 1 by default. Copyright © 2022 DZone, Inc. All rights reserved. No part of this publication
may be reproduced, stored in a retrieval system, or transmitted, in any form or
by means of electronic, mechanical, photocopying, or otherwise, without prior
As a rule of thumb, for a standard Kafka cluster with three brokers and a written permission of the publisher.
topic replication factor of 3, min.insync.replicas should be set to 2.

REFCARD | JUNE 2022 7 BROUGHT TO YOU IN PARTNERSHIP WITH

You might also like