Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Apache Storm

Agenda
● What is Storm?
● Concepts
● How it works?
● Types of groupings
● Reliability
● Workers
● Tasks
● Hands on
Apache Storm
● Storm is a distributed real-time computation system.
● It basically works for streams.
● Streams is an unbounded incoming data.

Another option to work on Streams is Apache Spark


Concepts & Definitions
● Tuples : Tuples is a named list of values. The type of the values in tuples can
be of any type.
● Spout : Source of data. Emits tuples. E.g, file spout, kafka spout, DB Spout
etc
● Bolts : Processors of storm cluster. All processing is done here. E.g: filter etc
● Topology : They way Spouts and Bolts are configured form a topology.
● Streams : Unbounded data from a particular source.
● Groupings : They are the way decide which emitted data will land at what
location.
Topology

BOLT

BOLT

SPOUT BOLT

SPOUT BOLT BOLT`


Types of Groupings
8 build in groupings in Storm

1. Shuffle : Tuples are randomly distributed across the bolt's tasks in a way
such that each bolt is guaranteed to get an equal number of tuples.
2. Fields : The stream is partitioned by the fields specified in the grouping.
3. Partial Key : The stream is partitioned by the fields specified in the grouping,
like the Fields grouping, but are load balanced between two downstream
bolts, which provides better utilization of resources when the incoming data is
skewed
4. All : The stream is replicated across all the bolt's tasks.
Types of Grouping - continued
1. Global : The entire stream goes to a single one of the bolt's tasks.
2. None : Equivalent to shuffle grouping
3. Direct : This is a special kind of grouping. A stream grouped this way means
that the producer of the tuple decides which task of the consumer will receive
this tuple. Direct groupings can only be declared on streams that have been
declared as direct streams. Tuples emitted to a direct stream must be emitted
using one of the [emitDirect]
4. Local or shuffle grouping: If the target bolt has one or more tasks in the
same worker process, tuples will be shuffled to just those in-process tasks.
Otherwise, this acts like a normal shuffle grouping.
Reliability
● Storm guarantees every tuple will be fully processed
● Every topology has a "message timeout" associated with it. If Storm fails to
detect that a spout tuple has been completed within that timeout, then it fails
the tuple and replays it later.
Workers
● Topologies work across one or more workers.
● Each worker process is a separate JVM and executes many tasks.
● Storm tries to spread tasks evenly across the cluster
Tasks
● Each bolt or spout executes many tasks in the cluster
Execution

You might also like