Professional Documents
Culture Documents
Apache Flink® Training: Intro
Apache Flink® Training: Intro
Apache Flink® Training: Intro
Intro
2
Stream Processing
process records
one-at-a-time
...
Your
Code
3
Distributed Stream Processing
...
...
...
● partitions input streams by
some key in the data
● distributes computation
across multiple instances
Your Your Your
Code Code Code ● each instance is
responsible for some key
range
4
Stateful Stream Processing
...
...
var x = …
update local
variables/structures
qwe
Your Your if (condition(x)) {
Code Code
…
}
Process
5
Stateful Stream Processing
...
...
var x = …
update local
variables/structures
qwe
Your Your if (condition(x)) {
Code Code
… ● embedded local state
} backend
6
About time ...
...
...
7
Traditional batch processing
t
● Continuously
... ingesting data
● Time-bounded
batch files
2017-06-14 2017-06-14 2017-06-13 2017-06-13
01:00am 00:00am 11:00pm 10:00pm
● Periodic batch
jobs
Batch
jobs
8
Traditional batch processing (II)
t
● Consider computing
... conversion metric
(# of A → B per hour)
9
The ideal way
accumulate state
10
The ideal way (II)
● Stateful stream processing as
a new paradigm to
continuously process
continuous data
Stateful Stream Processor
that handles ● Produce accurate results
11
Flink APIs and Runtime
12
Apache Flink Stack
Libraries
Runtime
Distributed Streaming Data Flow
13
Programming Model
Source Source
Computation stat
e
Sin
Sin k
k
14
Parallelism
Distributed
Execution
Levels of abstraction
18
Process Function
19
Data Stream API
stats.addSink(new RollingSink(path))
20
Table API & Stream SQL
21
Deployment Options
22
Local Execution
▪ Starts local Flink cluster
JVM
Task Task
▪ Behaves just like a regular Cluster Manager Manager
23
Remote Execution
Task Task
▪ Submit a Job to a Manager Manager
remotely running
cluster
Task Task
Manager Manager
▪ Monitor the status of
a job
Cluster
24
YARN Job Mode
▪ Brings up a Flink Resource Manager
cluster in YARN to run
a single job
Node Manager Node Manager
Task
Job Manager
▪ Better isolation than Manager
session mode
Node Manager Node Manager
Task Other
Manager Application
Client
YARN Cluster
25
YARN Session Mode
▪ Starts a Flink cluster in Resource Manager
YARN containers
▪ Multi-user scenario Node Manager Node Manager
▪ Easy to setup
Node Manager Node Manager
Task Other
Manager Application
Client
YARN Cluster
26
Other Deployment Options
▪ Apache Mesos
• Either with or without DC/OS
▪ Amazon Elastic MapReduce
• Available in EMR 5.1.0
▪ Google Compute Engine
• Available via bdutil
▪ Docker / Kubernetes
Flink in the real world
28
Flink community
Github
41 meetups
16,544 members
29
Powered by Flink
Zalando, one of the largest ecommerce King, the creators of Candy Crush Saga,
companies in Europe, uses Flink for real- uses Flink to provide data science teams
time business process monitoring. with real-time analytics.
Alibaba, the world's largest retailer, built a Bouygues Telecom uses Flink for real-time
Flink-based system (Blink) to optimize event processing over billions of Kafka
search rankings in real time. messages per day.
See more at
30
flink.apache.org/poweredby.html
31
Largest job has > 20 operators, runs on > 5000
vCores in 1000-node cluster, processes millions of
events per second
32
What is being built with Flink?
33
@
34
Flink Forward 2016
35
Flink Forward 2017
36
http://training.data-artisans.com/