Apache Spark - Executors - How Many Tasks Can My Cluster Run in Parallel - by Swetha Murali - Medium

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Search Write

Apache Spark — Executors —


How many tasks can my cluster
run in parallel
swetha murali · Follow
3 min read · Dec 21, 2023

31 1

Executors are responsible for actually executing the work that the driver
assigns them. All the computation requires a certain amount of memory
and CPU.

An executor is a single JVM process that is launched for a spark


application on a node. A node can have multiple executors.

Lets consider a 10-Node cluster with 16 Core, 64 GB RAM on each node.


Let’s do math based in above numbers throughout .

How many executors do we need for optimal performance?


Thin ( Slicing cores & RAM very thin )

1. allotting bare minimum resource possible for each executor. In our


case, we will be allotting 1 core for each executor and 4GB RAM.
2. Problem — No multithreading, and multiple tasks cannot run within
the same executor.

Fat — ( Slicing cores & RAM too Big )

1. Giving maximum possible resource for the executor . In our case,


allotting all 16 core and 64GB RAM to one executor. which also mean
only one executor for entire node.

2. Problem —If executors hold huge amount of memory, Garbage


collection ( removing unused objects from memory) will take lot of
time.

3. HDFS throughtput suffers — they get slow.

Balanced approach — Anything in extreme is not good

We want multi threading on each executor , at the same time throughput


should also not get affected. It is observed, HDFS achieves full write
throughput with ~5 tasks per executor, which means we can allot ~5 cores
for each executor to get optimal performance .

Calculating total number of executors in the cluster


We have 10 Node cluster with, 16 CPU cores anf 64 GB RAM on each node
.

1 core goes for background processes and 1GB RAM is given to OS. We
have 15 core CPU and 63 GB RAM.

Doing a optimal cut, we can have 3 executors — 5 CORES and 21 GB RAM


each.

Off-Heap Memory
This is memory outside the JVM heap space, used for off-heap storage
(like caching), reducing GC overhead.

off heap memory = MAX(384 MB, 7% of executor memory)

In our case, 7% of executor memory ( 7 % of 21 GB = 1.5 GB) goes to off


heap memory. 19 GB remaining to be allotted for executors.

Total Executors
Each executor has 5 cores and 19 GB RAM, with 3 executors per node,
making it 30 executors across the cluster.

One goes for YARN application manager, resulting in 29 executors on


the cluster.

How many tasks can run in parallel


Each executor can run 5 tasks in parallel ( 5 Cores each) and and with
3 executors in each node, 15 tasks can run in parallel on a node.
Across the cluster, we have one executor alloted to yarn manager. so,
29 *5 , 145 tasks can run in parallel.

Node1 Eachexecutorcanhandle
5core 19GB 19GB 5tasks/partitions
5core 19GB 5core
Executor1 Executor2 Executor3

acrosscluster,29
executors
29*5=145tasks/
Node10
partitions

5core 19GB 5core 19GB Yamresource


Manager
Executor1 Executor2

Spark Executors

Written by swetha murali Follow

7 Followers

More from swetha murali


More from swetha murali

swetha murali swetha murali

Apache Spark—caching a RDD Apache Spark—RDDs


Generally we cache data, so they can be Spark jobs are typically executed against
accessed faster. Same Concept goes here… RDDs (Resilient Distributed Dataset). The…
as well. are the basic units which hold records of
2 min read · Dec 15, 2023 2 that…· Dec 15, 2023
min read
data

swetha murali

Exploring Thailand.
We wish to share our travel experience in
exploring some places in Thailand which…
we thoroughly enjoyed.
3 min read · Jul 17, 2019

See all from swetha murali


Recommended from Medium

Vengateswaran Arunac… in Python in Plain En… Subham Khandelwal


alam ish
Day 13—Spark Shuffling Behind Apache Spark Interview Series—
the scenes Test your Knowledge
Here is the sample code how shuffling is Questions on Apache Spark to test your
working behind the scenes, ability for Interviews and knowledge on…
Spark background
· 2 min read · Mar 19, 2024 3 min read · Mar 17, 2024

31 7

Lists

Natural Language
Processing
1321 stories · 811 saves
Basil Skaria Gayan Sanjeewa

PySpark Performance Spark Questions : Heavy Job


Improvement: Part 2 Optimization : 22
Apart from built-in functionality, What already know
performance of a system lies in its…
execution path. We discussed in part 1,
4
about how· an
min read OctSQL
11, 2023
query is… · 7 min read · Feb 21, 2024

13

Shanoj in Stackademic Think Data

Understanding Memory Spills in Data Pipeline, ETL, and ELT for


Apache Spark Optimal Data Management
Memory spill in Apache Spark is the In the dynamic landscape of data
process of transferring data from RAM to… management, strategies like Data Pipeline…
disk, and potentially back again. This and ETL (Extract, Transform, Load) reign
· 5 minwhen
happens read ·the…
Mar 11, 2024 · 4 min read · Dec 11, 2023
supreme…

61 1

See more recommendations


Help Status About Careers Blog Privacy Terms Text to speech Teams

You might also like