Cours - Kafka

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 72

Kafka - Spark streaming - ESGI 2020.

IPPON 2019
2018
What is Kafka Kafka architecture Produce and consume
messages

Motivations Start Kafka


● Kafka

What is Kafka ? ● Confluent


LinkedIn before Kafka

Ippon Technologies 2018


Linkedin after Kafka

Ippon Technologies 2018


What is Kafka ?

● MOM (Message Oriented Middleware)


● Used to publish and subscribe to streams of
records
● It’s Scalable
● It’s Polyglot
● It’s Fast

Ippon Technologies 2018


What about today ?

● Created by Linkedin in 2009


● Open source since 2011
● Part of the Apache foundation
○ Very active community
○ Current version 2.3.0

● Spinoff company Confluent created in 2014


○ Jay Kreps, Neda Narkhede and Jun Rao
○ Created the confluent platform
○ Raised several billion dollars (2,5 Billions 23/01/19)

Ippon Technologies 2018


What is Kafka ?

● Message bus
○ Written in Scala
○ Heavily inspired by transaction logs
● Initially created at LinkedIn in 2010
○ Open sourced in 2011
○ Became an apache top level project in 2012
● Designed to support batch and real time analytics
● Performs very well, especially at very large scale
What is Confluent ?

● Founded in 2014 by the creators of Kafka


● Provides support, training etc. for Kafka
● Provides the confluent platform
○ A lot of products to work with Kafka, to produce messages, transform data etc.
● Traditional systems

The importance of real time


Motivations ●

● The birth of Kafka


Traditional systems

● In a traditional system, data is dispatched over some databases


○ In a database, HDFS etc.
○ Each producer implements its own transformation logic et write into the database
● Over time, the system will grow
○ The codebase grows and becomes hard to maintain
Traditional systems

● At first, it is easy to connect several systems, several data sources to


databases
Traditional systems

● But eventually it becomes hard to maintain


The importance of real time

● Batch processing is traditional and well known


○ We use this approach with Spark
○ Every day, week etc. I run my batch processing
● But it implies a strong restriction
○ I need to wait the batch to be executed to start data analysis
The importance of real time

● Nowadays, it is really common to have real time processing needs


○ Fraud detection
○ Recommander systems
○ Log monitoring
○ real time supplier for HDFS
○ etc.
Kafka

● Kafka has been created to solve 2 issues


○ Simplify the architecture of data flows
○ Handle data streaming

● Kafka separate data production and consumption


○ Both are usually tied into one application
○ “Publish” / “Subscribe” concepts
Kafka

● Kafka is designed to work in a cluster


● A cluster is a set of instances (nodes) that know each other
Kafka

● Once the data in Kafka, It can be read by several / different consumers


○ A consumer which writes in HDFS, another applying an alerting process etc.
● Increasing the number of consumers does not have any significant impact
on performances
● A consumer can be added without touching the producer
● Fundamentals

● Producing messages

Partitioning
The architecture ●

● Consuming messages

● Zookeeper
Fundamentals

● Data sent into Kafka are messages


○ Each message is key / value pair
○ By default, messages do not have any schema
● Each message is written in a topic
○ It is a way to group messages
○ Very close (conceptually) to a message queue
● Topics can be created in advance or dynamically by the producers
Fundamentals

● The 4 key components of Kafka are


○ Producers
○ Brokers
○ Consumers
○ Zookeeper
The producer

● It has the task of sending messages to the Kafka cluster


● One can write a producer in a lot of programming languages
○ Java, C, Python, Scala etc. In our case, it will be Scala
About messages

● It is a key / value pair


● keys and values can be of any type
○ You provide a serializer to tell to the producer how to transform the data into a byte array
● The key is optional
○ It is used for partitioning (see that soon)
○ Without any key provided, the message can be written in any partition
About partitioning

● Topics are splitted into partitions


● Each partition contains a subset of the topic’s messages
● Kafka use the key (hashed) to choose the partition where the message will
be written
● Partitions are dispatched on the whole cluster
The broker

● The broker is the heart of Kafka


● It receives messages and persists them
● Highly performant (can handle several millions of messages per second)
The broker

● A Kafka cluster usually contains several brokers


○ For development / testing purpose, we may work with only one
● Each broker handle one or several partitions
○ Partitions are dispatched over the whole cluster
The consumer

● It reads messages from Kafka


● Several consumers can read the same topic
○ Each consumer will receive all messages from the topic (default behaviour)
● It recevies messages by pulling them from Kafka
○ Other products offer to push them to the consumers
○ The main advantage of pulling is that it does not overload the consumer (backpressure)
○ The consumer reads as its own speed
Zookeeper

● Apache project
● It is a configuration centralisation tool
● It is used by Kafka’s internals
Global architecture
● HDFS & RDBMS

Kafka versus ● CAP Theorem


HDFS & RDBMS

● Kafka is similar to products like RabbitMQ


○ RabbitMQ pushes messages to consumer
● Kafka can be used as a database, by modifying the message retention
duration
○ It is not its main purpose
○ It is hard to manipulate messages individually
● It is a kind of orchestrator, it supplies different services, different
databases
○ Such as HDFS
HDFS

● Distributed file system


● Scala extremely well
○ Even when more than a thousand of nodes composes the cluster
● Not so true for Cassandra or MongoDB
○ Beyond a certain number of nodes, performances decrease
CAP theorem

● Consistency, Availability, Partition tolerance


● A distributed system can only satisfy at most 2 of those properties
○ A RDBMS is CA, because not scalable so not concerned by P

C A P

Kafka X X

MongoDB X X

Cassandra X X

HDFS X X
● Partitions

Advanced ● Commit log

● Consumer group and offset


architecture ● Replicas
Partitions

● Each topic is divided into partitions

Each topic is divided into


one or several partitions

Partitions are distributed


over all the brokers in the
cluster
Partitions

● With partitions, we can scale. Data is no longer centralised mais


distributed
● Inside the same partition, data are read is the same order they are written.
Order is guaranteed on the partition
● On the other hand, from the point view of the topic, there is no order
guarantee between messages (coming from different partitions)
● This is why it is important to choose the right key if the order does matter
Commit log

● Data of each partition are persisted in a commit log


● Commonly implemented with a file in “append only” mode
○ Thus, data is immutable and reads / writes are highly efficient
● Also used by classical RDBMS
○ To trace all the changes that happen on tables
Consumer group

● Several consumers can consume together as a consumer group


○ They will not read the same messages from a given topic
○ They will share messages, a message will be read only once in the group
● Each consumer will read from one or several partitions
● Data from a partition will be read by only one consumer in the group
Consumer group

Consumers in a group Single consumers


share partitions consume all the partitions
Consumer group

● The number of useful consumers is limited by the number of partitions


○ A useful consumer receives data
○ Others do not, they are waiting
Offset

● For each consumer group and each partition, Kafka keeps an offset (an
integer)
● It is the position of the last element read by a given consumer group in a
given partition
Offset

● When a consumer asks for a message, Kafka search for any offset it has
for this consumer group (in any partition of the requested topic) and send
the corresponding message
● When a consumer gets a message, it will commit it
● When a consumer commits, Kafka increments the offset for the given
partition
● We can ask Kafka to read from a specific offset. Thus the consumer can
consume from wherever it wants
Réplicas

● It is possible (and recommended) to replicate partitions


● Replicas are perfect copies of main partitions

Topic-Partition-1 Topic-Partition-2

Topic-Replica-2 Topic-Replica-1

Broker 1 Broker 2
Réplicas

● If a broker is down, the replica becomes the leading partition and thus we
can still consume / produce messages

Topic-Partition-1 Topic-Partition-2

Topic-Replica-2 Topic-Partition-1

Broker 1 Broker 2
● Start Kafka

Produce and ● Dependencies

● Produce
consume ● Consume
Start Kafka

● Download Zookeeper and Kafka


(https://www-us.apache.org/dist/zookeeper/current/zookeeper-3.4.12.tar.
gz &
https://www.apache.org/dyn/closer.cgi?path=/kafka/2.1.0/kafka_2.12-2.1
.0.tgz)
● bin/zookeeper-server-start.sh config/zookeeper.properties
● bin/kafka-server-start.sh ./config/server.properties
Console

● Kafka provides command line tools to manipulate topics, consume


messages etc.
● To create a topic
○ bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1
--topic test
Console

● To produce a message
○ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
● To consume a topic
○ bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test
--from-beginning
Scala dependencies

● Kafka is a Scala dependency


○ One can use Maven or SBT
○ With Maven :
Scala dependencies

● Kafka is a Scala dependency


○ One can use Maven or SBT
○ With SBT :
Produce

● To start, we need to instantiate a producer


Produce

● Then we need to configure the producer. There are 3 mandatory properties:


○ The address of at least one broker
○ The serializers for the key and the value
○ Other serializers are provided by Kafka and we can define our own serializers
Produce

● Kafka provides a utility class to simplify the configuration


Produce

● There is a lot of possible parameters


● Everything is documented
Produce

● To send a message
Produce

● The call to producer.send() is asynchronous (non blocking)


● It does not bloque the code
● To force a synchronous call (blocking), we need to call
producer.send().get()
Produce

● To get the result, there are two ways


● The call producer.send() returns a Future
○ Unfortunately, it is a Java Future, hard to use in Scala
Produce

● The method producer.send() can also take a function as parameter, a


callback
● When the call will be done, the callback function will be called
Consume

● As for the producer, we need to instantiate the consumer


Consume

● As for the producer, we need to configure the consumer


● Another parameter is mandatory, the group id
Consume

● Several other parameters can be set


● For example, the parameter enable.auto.commit is used to tell the
consumer if it has to commit automatically. Otherwise, it has to be done
manually
○ If the property is set to true, the consumer commit every auto.commit.interval.ms
(5000ms by default)
● By default, enable.auto.commit is set to true
Consume

● Then, we need to subscribe to topics we wish to consume


● Kafka will then dispatch partitions between every consumer (of a given
group)
Consume

● Then we can fetch the results


● The call to poll is synchronous. If not a single message is available, the
consumer waits the duration indicated in parameter before giving control
back to the user
Consume

● If we set the parameter enable.auto.commit to be false, we will have to


manually commit, otherwise we will indefinitely read the same messages
Consume

● We can also asynchronously commit


Confluent ecosystem

● Schema registry
○ Offers possibility to apply schemas to messages
● Kafka Streams
○ High level library (offers a DSL) to transform data between topics
○ Plays the role of T in ETL
● Kafka Connect
○ Offers connectors to supply Kafka with data or transform data from Kafka to other
systems
■ There are connectors for HDFS, file system, cassandra etc.
○ Plays the role of E in ETL if the connector is a source and L if it is a sink
● etc.
Kafka Streams

● High level API to consume and producer messages between topics


○ Is used to transform data
○ Kafka Streams also offers a low level API. We will concentrate on the high level API
● is an alternative to
○ Spark Streaming
○ Apache Storm
○ Akka stream
○ etc.
Kafka Streams

● Kafka streams has 2 concepts


● KStream
○ The topic is seen as a data flow, where every data is independent from other data
● KTable
○ Similar to a changelog. Each data is seen as an update (depending on the key)
● For example, I have a topic with two elements (“euro”, 5) and (“euro”, 1)
○ If I create a KStream on this topic and sum the values in euros, I will get 6
○ If I create a KTable, I will get 1
Kafka Streams

● Kafka streams offers usual high level functions :


○ map
○ filter
○ groupByKey
○ count
○ etc.
Kafka Streams

● Simple example
Kafka Streams

● Simple example
Kafka Streams

● Word count example

You might also like