Apache Spark - Executors - How Many Tasks Can My Cluster Run in Parallel - by Swetha Murali - Medium

Search Write
Apache Spark — Executors —

How many tasks can my cluster
run in parallel
swetha murali · Follow
3 min read · Dec 21, 2023
31 1
Executors are responsible for actually executing the work that the driver
assigns them. All the computation requires a certain amount of memory
and CPU.
An executor is a single JVM process that is launched for a spark

application on a node. A node can have multiple executors.
Lets consider a 10-Node cluster with 16 Core, 64 GB RAM on each node.

Let’s do math based in above numbers throughout .
How many executors do we need for optimal performance?

Thin ( Slicing cores & RAM very thin )
1. allotting bare minimum resource possible for each executor. In our

case, we will be allotting 1 core for each executor and 4GB RAM.
2. Problem — No multithreading, and multiple tasks cannot run within
the same executor.
Fat — ( Slicing cores & RAM too Big )
1. Giving maximum possible resource for the executor . In our case,

allotting all 16 core and 64GB RAM to one executor. which also mean
only one executor for entire node.
2. Problem —If executors hold huge amount of memory, Garbage

collection ( removing unused objects from memory) will take lot of
time.
3. HDFS throughtput suffers — they get slow.
Balanced approach — Anything in extreme is not good
We want multi threading on each executor , at the same time throughput

should also not get affected. It is observed, HDFS achieves full write
throughput with ~5 tasks per executor, which means we can allot ~5 cores
for each executor to get optimal performance .
Calculating total number of executors in the cluster

We have 10 Node cluster with, 16 CPU cores anf 64 GB RAM on each node
.
1 core goes for background processes and 1GB RAM is given to OS. We
have 15 core CPU and 63 GB RAM.
Doing a optimal cut, we can have 3 executors — 5 CORES and 21 GB RAM

each.
Off-Heap Memory
This is memory outside the JVM heap space, used for off-heap storage
(like caching), reducing GC overhead.
off heap memory = MAX(384 MB, 7% of executor memory)
In our case, 7% of executor memory ( 7 % of 21 GB = 1.5 GB) goes to off

heap memory. 19 GB remaining to be allotted for executors.
Total Executors
Each executor has 5 cores and 19 GB RAM, with 3 executors per node,
making it 30 executors across the cluster.
One goes for YARN application manager, resulting in 29 executors on

the cluster.
How many tasks can run in parallel

Each executor can run 5 tasks in parallel ( 5 Cores each) and and with
3 executors in each node, 15 tasks can run in parallel on a node.
Across the cluster, we have one executor alloted to yarn manager. so,
29 *5 , 145 tasks can run in parallel.
Node1 Eachexecutorcanhandle
5core 19GB 19GB 5tasks/partitions
5core 19GB 5core
Executor1 Executor2 Executor3
acrosscluster,29
executors
29*5=145tasks/
Node10
partitions
5core 19GB 5core 19GB Yamresource

Manager
Executor1 Executor2
Spark Executors
Written by swetha murali Follow
7 Followers
More from swetha murali

More from swetha murali
swetha murali swetha murali
Apache Spark—caching a RDD Apache Spark—RDDs

Generally we cache data, so they can be Spark jobs are typically executed against
accessed faster. Same Concept goes here… RDDs (Resilient Distributed Dataset). The…
as well. are the basic units which hold records of
2 min read · Dec 15, 2023 2 that…· Dec 15, 2023
min read
data
swetha murali
Exploring Thailand.
We wish to share our travel experience in
exploring some places in Thailand which…
we thoroughly enjoyed.
3 min read · Jul 17, 2019
See all from swetha murali

Recommended from Medium
Vengateswaran Arunac… in Python in Plain En… Subham Khandelwal

alam ish
Day 13—Spark Shuffling Behind Apache Spark Interview Series—
the scenes Test your Knowledge
Here is the sample code how shuffling is Questions on Apache Spark to test your
working behind the scenes, ability for Interviews and knowledge on…
Spark background
· 2 min read · Mar 19, 2024 3 min read · Mar 17, 2024
31 7
Lists
Natural Language
Processing
1321 stories · 811 saves
Basil Skaria Gayan Sanjeewa
PySpark Performance Spark Questions : Heavy Job

Improvement: Part 2 Optimization : 22
Apart from built-in functionality, What already know
performance of a system lies in its…
execution path. We discussed in part 1,
4
about how· an
min read OctSQL
11, 2023
query is… · 7 min read · Feb 21, 2024
13
Shanoj in Stackademic Think Data
Understanding Memory Spills in Data Pipeline, ETL, and ELT for

Apache Spark Optimal Data Management
Memory spill in Apache Spark is the In the dynamic landscape of data
process of transferring data from RAM to… management, strategies like Data Pipeline…
disk, and potentially back again. This and ETL (Extract, Transform, Load) reign
· 5 minwhen
happens read ·the…
Mar 11, 2024 · 4 min read · Dec 11, 2023
supreme…
61 1
See more recommendations

Help Status About Careers Blog Privacy Terms Text to speech Teams

Apache Spark - Executors - How Many Tasks Can My Cluster Run in Parallel - by Swetha Murali - Medium

Uploaded by

Copyright:

Available Formats

You might also like

Apache Spark - Executors - How Many Tasks Can My Cluster Run in Parallel - by Swetha Murali - Medium

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Apache Spark - Executors - How Many Tasks Can My Cluster Run in Parallel - by Swetha Murali - Medium

Uploaded by

Copyright:

Available Formats

Search Write

Apache Spark — Executors —

An executor is a single JVM process that is launched for a spark

Lets consider a 10-Node cluster with 16 Core, 64 GB RAM on each node.

How many executors do we need for optimal performance?

1. allotting bare minimum resource possible for each executor. In our

Fat — ( Slicing cores & RAM too Big )

1. Giving maximum possible resource for the executor . In our case,

2. Problem —If executors hold huge amount of memory, Garbage

3. HDFS throughtput suffers — they get slow.

Balanced approach — Anything in extreme is not good

We want multi threading on each executor , at the same time throughput

Calculating total number of executors in the cluster

Doing a optimal cut, we can have 3 executors — 5 CORES and 21 GB RAM

off heap memory = MAX(384 MB, 7% of executor memory)

In our case, 7% of executor memory ( 7 % of 21 GB = 1.5 GB) goes to off

One goes for YARN application manager, resulting in 29 executors on

How many tasks can run in parallel

5core 19GB 5core 19GB Yamresource

Written by swetha murali Follow

More from swetha murali

swetha murali swetha murali

Apache Spark—caching a RDD Apache Spark—RDDs

See all from swetha murali

Vengateswaran Arunac… in Python in Plain En… Subham Khandelwal

PySpark Performance Spark Questions : Heavy Job

Shanoj in Stackademic Think Data

Understanding Memory Spills in Data Pipeline, ETL, and ELT for

See more recommendations

You might also like