Professional Documents
Culture Documents
Spark Jobs Stage Shuffle Task Slots 1686774188
Spark Jobs Stage Shuffle Task Slots 1686774188
Task, Slots
@Manjunath Angadi
https://www.linkedin.com/in/manjunathangadi
Introduction
Apache Spark is a powerful distributed processing framework that enables fast and
scalable data processing and analytics. Spark divides the processing tasks into
smaller units called stages, which are further broken down into tasks. The execution
of these tasks is distributed across a cluster of machines. In this document, we will
explore the concepts of stages, shuffle, tasks, and slots in Spark, along with a simple
example to illustrate their usage.
Job:
for each action there will be a job created and this also creates the logical plans and
based upon the cost, best plan will be choose for execution.
Stage :
Driver will break the logical plan and creates stage for each wide transformation.
Each stage can have one or more transformations and one wide transformation.
Narrow stages : are the data transformations which are performed in parallel
and does not require shuffling or data movement.
Wide Stages: are the data transformations that requires shuffling or data
movement across partitions. It occurs when a transformation requires data from
multiple partitions to be grouped together, such as aggregations or joins. Wide
stages introduce a performance overhead due to the data movement involved.
Shuffle/sort:
After each stage end write and read exchange will be created and the shuffle/sort will
be run on the data and moves the output to next stages.
For start of next stage read exchange will be created to read the data from the
previous stage.
Task:
Tasks are the smallest unit of work in Spark. Each stage consists of multiple tasks
that are executed in parallel across a cluster of machines. Tasks are responsible for
processing a subset of the data and applying the required transformations or actions.
Slots:
A slot represents a unit of resources available on an executor to run a task. An
executor can have multiple slots, and each slot can execute one task at a time. The
Example
Let's consider a simple example to illustrate the concepts of stages, shuffle, tasks,
and slots in Spark.
1. Stage 1 - Narrow Stage: In the first stage, Spark reads the data from a
distributed storage system and applies a filter to select only the relevant product
IDs. This stage does not require shuffling as each executor can process the data
independently.
2. Stage 2 - Wide Stage (Shuffle): In the second stage, Spark performs a group
by operation on the filtered data to group the records by product ID. This stage
requires shuffling as data from multiple partitions needs to be grouped together
based on the product ID.
3. Stage 3 - Narrow Stage: In the final stage, Spark calculates the total revenue
for each product by multiplying the quantity and price. This stage does not
require shuffling as each executor can compute the revenue independently.
Each stage is further divided into multiple tasks that are executed in parallel on
different machines. The number of tasks depends on the data partitioning and the
available resources in the cluster. Each task runs in a slot on an executor, and
multiple tasks can be executed concurrently on a single executor.
By dividing the computation into stages, Spark optimizes the execution plan and
provides fault tolerance by allowing the re computation of failed tasks.