Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Spark Jobs - Stage, Shuffle,

Task, Slots
@Manjunath Angadi

https://www.linkedin.com/in/manjunathangadi

Introduction
Apache Spark is a powerful distributed processing framework that enables fast and
scalable data processing and analytics. Spark divides the processing tasks into
smaller units called stages, which are further broken down into tasks. The execution
of these tasks is distributed across a cluster of machines. In this document, we will
explore the concepts of stages, shuffle, tasks, and slots in Spark, along with a simple
example to illustrate their usage.

Job:
for each action there will be a job created and this also creates the logical plans and
based upon the cost, best plan will be choose for execution.

Stage :
Driver will break the logical plan and creates stage for each wide transformation.
Each stage can have one or more transformations and one wide transformation.

there are 2 types of stages

Narrow stages : are the data transformations which are performed in parallel
and does not require shuffling or data movement.

Example: select(), filter(), withColumn(), drop(), where()

Wide Stages: are the data transformations that requires shuffling or data
movement across partitions. It occurs when a transformation requires data from
multiple partitions to be grouped together, such as aggregations or joins. Wide
stages introduce a performance overhead due to the data movement involved.

Spark Jobs - Stage, Shuffle, Task, Slots 1


Example: groupBy(), join(), cube(), reduceByKey(), rollup() , agg() and
repartition() etc.

Shuffle/sort:
After each stage end write and read exchange will be created and the shuffle/sort will
be run on the data and moves the output to next stages.

For start of next stage read exchange will be created to read the data from the
previous stage.

Note: Shuffling can be an expensive operation and should be minimized whenever


possible. Efficient data partitioning and the use of appropriate operations can help
reduce shuffling in Spark applications.

Example for the stages and shuffle/sort

Task:
Tasks are the smallest unit of work in Spark. Each stage consists of multiple tasks
that are executed in parallel across a cluster of machines. Tasks are responsible for
processing a subset of the data and applying the required transformations or actions.

Spark automatically assigns tasks to executors based on the available resources in


the cluster. Each executor can run multiple tasks concurrently, depending on the
available CPU cores and memory.

Slots:
A slot represents a unit of resources available on an executor to run a task. An
executor can have multiple slots, and each slot can execute one task at a time. The

Spark Jobs - Stage, Shuffle, Task, Slots 2


number of slots per executor depends on the available CPU cores and memory
configuration.
Slots help determine the degree of parallelism in Spark. By default, Spark assigns
one slot per CPU core, but it can be configured manually based on the application's
requirements.

Example
Let's consider a simple example to illustrate the concepts of stages, shuffle, tasks,
and slots in Spark.

Suppose we have a dataset of customer orders containing order details such as


order ID, product ID, quantity, and price. Our goal is to calculate the total revenue
generated from each product.

1. Stage 1 - Narrow Stage: In the first stage, Spark reads the data from a
distributed storage system and applies a filter to select only the relevant product
IDs. This stage does not require shuffling as each executor can process the data
independently.

2. Stage 2 - Wide Stage (Shuffle): In the second stage, Spark performs a group
by operation on the filtered data to group the records by product ID. This stage
requires shuffling as data from multiple partitions needs to be grouped together
based on the product ID.

3. Stage 3 - Narrow Stage: In the final stage, Spark calculates the total revenue
for each product by multiplying the quantity and price. This stage does not
require shuffling as each executor can compute the revenue independently.

Each stage is further divided into multiple tasks that are executed in parallel on
different machines. The number of tasks depends on the data partitioning and the
available resources in the cluster. Each task runs in a slot on an executor, and
multiple tasks can be executed concurrently on a single executor.

By dividing the computation into stages, Spark optimizes the execution plan and
provides fault tolerance by allowing the re computation of failed tasks.

Spark Jobs - Stage, Shuffle, Task, Slots 3

You might also like