Cours - Kafka

Kafka - Spark streaming - ESGI 2020.
IPPON 2019
2018
What is Kafka Kafka architecture Produce and consume
messages
Motivations Start Kafka

● Kafka
What is Kafka ? ● Confluent

LinkedIn before Kafka
Ippon Technologies 2018

Linkedin after Kafka

What is Kafka ?
● MOM (Message Oriented Middleware)

● Used to publish and subscribe to streams of
records
● It’s Scalable
● It’s Polyglot
● It’s Fast

What about today ?
● Created by Linkedin in 2009

● Open source since 2011
● Part of the Apache foundation
○ Very active community
○ Current version 2.3.0
● Spinoff company Confluent created in 2014

○ Jay Kreps, Neda Narkhede and Jun Rao
○ Created the confluent platform
○ Raised several billion dollars (2,5 Billions 23/01/19)

What is Kafka ?
● Message bus
○ Written in Scala
○ Heavily inspired by transaction logs
● Initially created at LinkedIn in 2010
○ Open sourced in 2011
○ Became an apache top level project in 2012
● Designed to support batch and real time analytics
● Performs very well, especially at very large scale
What is Confluent ?
● Founded in 2014 by the creators of Kafka

● Provides support, training etc. for Kafka
● Provides the confluent platform
○ A lot of products to work with Kafka, to produce messages, transform data etc.
● Traditional systems
The importance of real time

Motivations ●
● The birth of Kafka

Traditional systems
● In a traditional system, data is dispatched over some databases

○ In a database, HDFS etc.
○ Each producer implements its own transformation logic et write into the database
● Over time, the system will grow
○ The codebase grows and becomes hard to maintain
Traditional systems
● At first, it is easy to connect several systems, several data sources to

databases
Traditional systems
● But eventually it becomes hard to maintain

● Batch processing is traditional and well known

○ We use this approach with Spark
○ Every day, week etc. I run my batch processing
● But it implies a strong restriction
○ I need to wait the batch to be executed to start data analysis
● Nowadays, it is really common to have real time processing needs

○ Fraud detection
○ Recommander systems
○ Log monitoring
○ real time supplier for HDFS
○ etc.
Kafka
● Kafka has been created to solve 2 issues

○ Simplify the architecture of data flows
○ Handle data streaming
● Kafka separate data production and consumption

○ Both are usually tied into one application
○ “Publish” / “Subscribe” concepts
Kafka
● Kafka is designed to work in a cluster

● A cluster is a set of instances (nodes) that know each other
Kafka
● Once the data in Kafka, It can be read by several / different consumers

○ A consumer which writes in HDFS, another applying an alerting process etc.
● Increasing the number of consumers does not have any significant impact
on performances
● A consumer can be added without touching the producer
● Fundamentals
● Producing messages
Partitioning
The architecture ●
● Consuming messages
● Zookeeper
Fundamentals
● Data sent into Kafka are messages

○ Each message is key / value pair
○ By default, messages do not have any schema
● Each message is written in a topic
○ It is a way to group messages
○ Very close (conceptually) to a message queue
● Topics can be created in advance or dynamically by the producers
Fundamentals
● The 4 key components of Kafka are

○ Producers
○ Brokers
○ Consumers
○ Zookeeper
The producer
● It has the task of sending messages to the Kafka cluster

● One can write a producer in a lot of programming languages
○ Java, C, Python, Scala etc. In our case, it will be Scala
About messages
● It is a key / value pair

● keys and values can be of any type
○ You provide a serializer to tell to the producer how to transform the data into a byte array
● The key is optional
○ It is used for partitioning (see that soon)
○ Without any key provided, the message can be written in any partition
About partitioning
● Topics are splitted into partitions

● Each partition contains a subset of the topic’s messages
● Kafka use the key (hashed) to choose the partition where the message will
be written
● Partitions are dispatched on the whole cluster
The broker
● The broker is the heart of Kafka

● It receives messages and persists them
● Highly performant (can handle several millions of messages per second)
The broker
● A Kafka cluster usually contains several brokers

○ For development / testing purpose, we may work with only one
● Each broker handle one or several partitions
○ Partitions are dispatched over the whole cluster
The consumer
● It reads messages from Kafka

● Several consumers can read the same topic
○ Each consumer will receive all messages from the topic (default behaviour)
● It recevies messages by pulling them from Kafka
○ Other products offer to push them to the consumers
○ The main advantage of pulling is that it does not overload the consumer (backpressure)
○ The consumer reads as its own speed
Zookeeper
● Apache project
● It is a configuration centralisation tool
● It is used by Kafka’s internals
Global architecture
● HDFS & RDBMS
Kafka versus ● CAP Theorem

HDFS & RDBMS
● Kafka is similar to products like RabbitMQ

○ RabbitMQ pushes messages to consumer
● Kafka can be used as a database, by modifying the message retention
duration
○ It is not its main purpose
○ It is hard to manipulate messages individually
● It is a kind of orchestrator, it supplies different services, different
databases
○ Such as HDFS
HDFS
● Distributed file system

● Scala extremely well
○ Even when more than a thousand of nodes composes the cluster
● Not so true for Cassandra or MongoDB
○ Beyond a certain number of nodes, performances decrease
CAP theorem
● Consistency, Availability, Partition tolerance

● A distributed system can only satisfy at most 2 of those properties
○ A RDBMS is CA, because not scalable so not concerned by P
C A P
Kafka X X
MongoDB X X
Cassandra X X
HDFS X X
● Partitions
Advanced ● Commit log
● Consumer group and offset

architecture ● Replicas
Partitions
● Each topic is divided into partitions
Each topic is divided into

one or several partitions
Partitions are distributed

over all the brokers in the
cluster
Partitions
● With partitions, we can scale. Data is no longer centralised mais

distributed
● Inside the same partition, data are read is the same order they are written.
Order is guaranteed on the partition
● On the other hand, from the point view of the topic, there is no order
guarantee between messages (coming from different partitions)
● This is why it is important to choose the right key if the order does matter
Commit log
● Data of each partition are persisted in a commit log

● Commonly implemented with a file in “append only” mode
○ Thus, data is immutable and reads / writes are highly efficient
● Also used by classical RDBMS
○ To trace all the changes that happen on tables
Consumer group
● Several consumers can consume together as a consumer group

○ They will not read the same messages from a given topic
○ They will share messages, a message will be read only once in the group
● Each consumer will read from one or several partitions
● Data from a partition will be read by only one consumer in the group
Consumer group
Consumers in a group Single consumers

share partitions consume all the partitions
Consumer group
● The number of useful consumers is limited by the number of partitions

○ A useful consumer receives data
○ Others do not, they are waiting
Offset
● For each consumer group and each partition, Kafka keeps an offset (an
integer)
● It is the position of the last element read by a given consumer group in a
given partition
Offset
● When a consumer asks for a message, Kafka search for any offset it has
for this consumer group (in any partition of the requested topic) and send
the corresponding message
● When a consumer gets a message, it will commit it
● When a consumer commits, Kafka increments the offset for the given
partition
● We can ask Kafka to read from a specific offset. Thus the consumer can
consume from wherever it wants
Réplicas
● It is possible (and recommended) to replicate partitions

● Replicas are perfect copies of main partitions
Topic-Partition-1 Topic-Partition-2
Topic-Replica-2 Topic-Replica-1
Broker 1 Broker 2
Réplicas
● If a broker is down, the replica becomes the leading partition and thus we
can still consume / produce messages
Topic-Partition-1 Topic-Partition-2
Topic-Replica-2 Topic-Partition-1
Broker 1 Broker 2
● Start Kafka
Produce and ● Dependencies
● Produce
consume ● Consume
Start Kafka
● Download Zookeeper and Kafka

(https://www-us.apache.org/dist/zookeeper/current/zookeeper-3.4.12.tar.
gz &
https://www.apache.org/dyn/closer.cgi?path=/kafka/2.1.0/kafka_2.12-2.1
.0.tgz)
● bin/zookeeper-server-start.sh config/zookeeper.properties
● bin/kafka-server-start.sh ./config/server.properties
Console
● Kafka provides command line tools to manipulate topics, consume

messages etc.
● To create a topic
○ bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1
--topic test
Console
● To produce a message
○ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
● To consume a topic
○ bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test
--from-beginning
Scala dependencies
● Kafka is a Scala dependency

○ One can use Maven or SBT
○ With Maven :
Scala dependencies
● Kafka is a Scala dependency

○ One can use Maven or SBT
○ With SBT :
Produce
● To start, we need to instantiate a producer

Produce
● Then we need to configure the producer. There are 3 mandatory properties:

○ The address of at least one broker
○ The serializers for the key and the value
○ Other serializers are provided by Kafka and we can define our own serializers
Produce
● Kafka provides a utility class to simplify the configuration

Produce
● There is a lot of possible parameters

● Everything is documented
Produce
● To send a message
Produce
● The call to producer.send() is asynchronous (non blocking)

● It does not bloque the code
● To force a synchronous call (blocking), we need to call
producer.send().get()
Produce
● To get the result, there are two ways

● The call producer.send() returns a Future
○ Unfortunately, it is a Java Future, hard to use in Scala
Produce
● The method producer.send() can also take a function as parameter, a

callback
● When the call will be done, the callback function will be called
Consume
● As for the producer, we need to instantiate the consumer

Consume
● As for the producer, we need to configure the consumer

● Another parameter is mandatory, the group id
Consume
● Several other parameters can be set

● For example, the parameter enable.auto.commit is used to tell the
consumer if it has to commit automatically. Otherwise, it has to be done
manually
○ If the property is set to true, the consumer commit every auto.commit.interval.ms
(5000ms by default)
● By default, enable.auto.commit is set to true
Consume
● Then, we need to subscribe to topics we wish to consume

● Kafka will then dispatch partitions between every consumer (of a given
group)
Consume
● Then we can fetch the results

● The call to poll is synchronous. If not a single message is available, the
consumer waits the duration indicated in parameter before giving control
back to the user
Consume
● If we set the parameter enable.auto.commit to be false, we will have to

manually commit, otherwise we will indefinitely read the same messages
Consume
● We can also asynchronously commit

Confluent ecosystem
● Schema registry
○ Offers possibility to apply schemas to messages
● Kafka Streams
○ High level library (offers a DSL) to transform data between topics
○ Plays the role of T in ETL
● Kafka Connect
○ Offers connectors to supply Kafka with data or transform data from Kafka to other
systems
■ There are connectors for HDFS, file system, cassandra etc.
○ Plays the role of E in ETL if the connector is a source and L if it is a sink
● etc.
Kafka Streams
● High level API to consume and producer messages between topics

○ Is used to transform data
○ Kafka Streams also offers a low level API. We will concentrate on the high level API
● is an alternative to
○ Spark Streaming
○ Apache Storm
○ Akka stream
○ etc.
Kafka Streams
● Kafka streams has 2 concepts

● KStream
○ The topic is seen as a data flow, where every data is independent from other data
● KTable
○ Similar to a changelog. Each data is seen as an update (depending on the key)
● For example, I have a topic with two elements (“euro”, 5) and (“euro”, 1)
○ If I create a KStream on this topic and sum the values in euros, I will get 6
○ If I create a KTable, I will get 1
Kafka Streams
● Kafka streams offers usual high level functions :

○ map
○ filter
○ groupByKey
○ count
○ etc.
Kafka Streams
● Simple example
Kafka Streams
● Simple example
Kafka Streams
● Word count example

Cours - Kafka

Uploaded by

Copyright:

Available Formats

You might also like

Cours - Kafka

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cours - Kafka

Uploaded by

Copyright:

Available Formats

Kafka - Spark streaming - ESGI 2020.

Motivations Start Kafka

What is Kafka ? ● Conﬂuent

Ippon Technologies 2018

Ippon Technologies 2018

● MOM (Message Oriented Middleware)

Ippon Technologies 2018

● Created by Linkedin in 2009

● Spinoﬀ company Conﬂuent created in 2014

Ippon Technologies 2018

● Founded in 2014 by the creators of Kafka

The importance of real time

● The birth of Kafka

● In a traditional system, data is dispatched over some databases

● At ﬁrst, it is easy to connect several systems, several data sources to

● But eventually it becomes hard to maintain

● Batch processing is traditional and well known

● Nowadays, it is really common to have real time processing needs

● Kafka has been created to solve 2 issues

● Kafka separate data production and consumption

● Kafka is designed to work in a cluster

● Once the data in Kafka, It can be read by several / different consumers

● Data sent into Kafka are messages

● The 4 key components of Kafka are

● It has the task of sending messages to the Kafka cluster

● It is a key / value pair

● Topics are splitted into partitions

● The broker is the heart of Kafka

● A Kafka cluster usually contains several brokers

● It reads messages from Kafka

Kafka versus ● CAP Theorem

● Kafka is similar to products like RabbitMQ

● Distributed ﬁle system

● Consistency, Availability, Partition tolerance

Advanced ● Commit log

● Consumer group and offset

● Each topic is divided into partitions

Each topic is divided into

Partitions are distributed

● With partitions, we can scale. Data is no longer centralised mais

● Data of each partition are persisted in a commit log

● Several consumers can consume together as a consumer group

Consumers in a group Single consumers

● The number of useful consumers is limited by the number of partitions

● It is possible (and recommended) to replicate partitions

Produce and ● Dependencies

● Download Zookeeper and Kafka

● Kafka provides command line tools to manipulate topics, consume

● Kafka is a Scala dependency

● Kafka is a Scala dependency

● To start, we need to instantiate a producer

● Then we need to conﬁgure the producer. There are 3 mandatory properties:

● Kafka provides a utility class to simplify the conﬁguration

● There is a lot of possible parameters

● The call to producer.send() is asynchronous (non blocking)

● To get the result, there are two ways

● The method producer.send() can also take a function as parameter, a

● As for the producer, we need to instantiate the consumer