BDEv3.5 Perf. Benchmark and Architect Design

Blendata enterprise v.3.5.
x decoupled/dis-aggregated
compute & storage architectural design.
Blendata enterprise v.3.5.x decoupled/dis-aggregated compute & storage

architectural design.
Executive summary
Blendata - enterprise architecture design is based on a distributed architecture that

coupled multiple compute nodes (Linux x86_64 machines) together with the power of
its engine, The optimized Apache Spark®. Furthermore, with modernized
decoupled/dis-aggregated compute & storage architecture, our engine relies a lot on
storage capability. Hence, the following questions are how we should design the
compute nodes, such as should we provision a few big compute nodes? (Vertical), or
should we create lots of small compute nodes instead? (Horizontal). And, how much
storage throughput and IOPS are needed?. These questions are what this document
mostly addresses.
The test was run on Amazon Web Service (AWS) public cloud environment for
standardized purposes of the whole result. Comprising a bunch of AWS’s EC2 instances
with multiple types, and AWS’s FSx for Lustre for storage. The test data was based on
the industry standard, TPC-DS, with some of the TPC-DS queries that represent
real-world scenarios including wide scan, aggregate, and join.
Please note that for the accurate result, we recommend testing based on your
environment to maximize the best configuration and design. As lots of factors may
affect your hypothesis and result. However, this document tries to provide the best
guideline for design in the environment stated above. Following is the result
What is the best way to design compute nodes for decoupled/dis-aggregated

compute and storage architecture, Horizontal or Vertical?
The horizontal design is the best architectural design that will give you almost ideal
scalability. Problems with the vertical design are the bottleneck of the operating system
and storage client/gateway for each node. This storage bottleneck problem will lead to
slow performance on data scanning tasks (aka. When a compute node pulls files from
storage) and will affect overall performance respectively. Thus, we should leverage high
concurrent features of storage by provisioning multiple compute nodes and mounting
them to the same storage. Hence each node will simultaneously scan the smaller
portion of data, leading to higher performance.
How much vCPU/Core is needed per compute node?
The answer is it depends on your storage’s client performance. Previously when we

deployed our Blendata - enterprise on traditional Hadoop architecture consisting of
HDFS (Hadoop distributed file system) and YARN (yet another resource management;
Blendata enterprise v.3.5.x decoupled/dis-aggregated
resource management), there was no problem with storage bandwidth. As it tightly

coupled computing & storage together and its mechanism let the CPU directly read data
from each disk 1-by-1. Result in highly data locality with no-worry storage
bandwidth/throughput. But now, when we decouple compute from storage, we need to
mount the storage to compute by some storage protocol. Such as HDFS(only a
protocol), NFS, S3a, or in this case, Lustre client. Thus, it’s like the single point of storage
that needs to be queued to access the data. To calculate ideal executor cores per node,
you can use storage bandwidth per client / 300 MB/s (or 250MB/s at least). So, based
on this test, with around 1800 MB/s storage bandwidth per client via fio test, It’s
recommended to have 6-8 executor cores per node.
How much storage throughput is needed?
Storage throughput is mostly required when in the scanning stage (pulling the data
from storage). The higher the throughput, the faster and bigger pool of executors cores
that can parallel scan. Although to keep it simple, based on our recommendation from
the test result, every 1 executor will burst around 40-70 MB/s of throughput. You can
use this guideline to estimate the storage throughput needed. (for example, if you want
to have 48 executor cores, you need 48 x 60 MB/s = 2,880 MB/s throughput
requirement for storage). Please note that if storage throughput is lower than the
guideline, it will not break. But, it will slow down the performance of the compute
cluster. (higher IOWait%)
How much data is transfer between computing and storage
Again, it depends on the data characteristics. Blendata - enterprise v3.5 bundled lots of
advanced data management and data skipping techniques such as data partitioning,
predicate pushdown on file level, parquet dictionary (data statistic; a range of data),
cost-based optimization, dynamic partition pruning. All of these aim at one goal, to
reduce data transfer from storage to compute on the scanning stage as much as
possible. However, the size of data that can be skipped is directly related to the
characteristics of each data. E.g., If data have low cardinality (aka few unique values) on
a high amount of data, then the data can’t be skipped that much. But if data have high
cardinality, the result will be contrariwise. Based on our test, we found that we can
reduce the transfer data up to 0.1x to 20x compared to stored data.
We hope that this guideline will help you craft your or your partner's big data platform in
an optimum way. Any kind of request or suggestion on this matter can be done via an
email: support@blendata.co or directly call your sales representative.
Introduction
Background
Nowadays, Big data platforms always leverage various algorithms, methods, or

technologies to help users faster access high performance, parallel computing on big
data with lower cost of environments. Blendata, as the big data technology company,
and the owner of Blendata - enterprise (BDE), the simplified big data platform. Also aim
to improve our technology by enhancing our engine and adopt modernized techniques
such as Columnar storage on BDE v.2.x or Adaptive query execution on BDE v.3.x. With
the latest architecture, Decoupled/dis-aggregated compute and storage architecture
and various methods of data skipping techniques that are already included in BDE
platform (e.g., advanced file predicate pushdown, cost-based optimization, etc.). Some
of our customers or partners may inquire about how they design their hardware based
on this new architecture, what’s the difference between old architecture (coupled
compute & storage) and the new one, and so on. This guideline will help you answer
questions that are related to architecture design and its requirements through
references from many test scenarios which are stated in this document.
Objectives
1. To find out what’s the best architecture for dis-aggregated compute & storage
architecture
2. To find the limitations or factors to design the number of both compute &
storage resource requirements.
Technologies & methodologies background
Traditional big data platform architecture (coupled/aggregated compute &

storage)
5-10 years ago, big data platform adoption increased significantly due to the rise of data
volume, variety, and velocity, and game-changing open source technology “Hadoop”
that made the cost of ownership to be more reasonable. Hadoop technologies consist of
2 main components, the distributed file system layer called Hadoop Distributed File
System (HDFS) and the processing layer called MapReduce (later developed into YARN.
The bundled pack of processing API & resource management). Other than being the
technology that can leverage commodity hardware by integrating them and providing a
single cluster (resulting in lower cost compared to other big data platform vendors that
needed to use proprietary, expensive hardware), The key concept of HDFS is to make
data always available nearest to compute power. By splitting data into chunks, call
Blocks, and spread these blocks to many nodes in the cluster. The CPU in each node will
have the capability to directly read the data from the disk in the same node, and
simultaneously process it with other nodes on the same job. The result is a
high-performance, parallel computing platform with high and near-infinite (ideally)
throughput and scalability. Following is the diagram of a high-level Hadoop architecture
Many organizations adopted this technology and enabled lots of successful data
use-cases. However, from time to time, some challenges occurred with the adoption of
this technology. We will not address in detail about the vendor lock-in issue and license
model adjustment during these 2-3 years that happened due to some merged
Hadoop-provider company deciding to change their business model. (That's the

commercial side, and it’s not the purpose of this document). But what we heard our
customers (that already adopted Hadoop) said is how hard it is to maintain this
technology, and it can’t scale effectively compared to a cloud-based environment. We
can’t scale only storage if we want more space to store the bigger, new data. We can’t
scale only compute resources even if we want to add a new report-generated job
without new data occurring. With Hadoop, you need to scale by adding a new
server/node only that consists of CPU, memory, and storage, and it must be the same
size and specs as the previous servers/nodes. This gave lots of headaches, and unused
resources whether computing or storage, And it directly affects the total cost of
ownership of this platform.
Decoupled/Dis-aggregated compute & storage architecture
Actually, this methodology is not new. You can find similar concepts in the legacy
Massive parallel processing (MPP) or HPC cluster in scientific, mathematics industries,
or even legacy computing applications based on server & network array storage (NAS).
However, the main reason this architecture is not the default standard in the previous
decade is because of technology limitations only. Following is the diagram of high-level
architecture.
As you can see from the above picture, the main channel between compute & storage is
the network that bridges the resource together. This led to bottleneck problems when
transferring all data into compute nodes. That being said, even this architecture does
not look like it fits in the big data era. The benefits themselves still shine through the
cons due to the flexibility of the architecture. Users can scale only storage if they have
requirements to store more data, or can scale only compute resources if they have more
processing tasks or jobs. This concept of scale can’t be possible in coupled,
Hadoop-based architectures due to it being required to scale node by node only.
From our observation, this architecture became famous and widely adopted by the
game-changing technology, the Cloud-based computing era. Users can store as much
big data as they like on serverless blob storage such as AWS S3, Azure blob storage, and
Google Cloud Storage, while they can spawn only some computations for the jobs it
requires such as pure AWS EC2, Query service such as Athena, etc. This architecture
can be possible for big data stores and processing tasks because of 2 major reasons. 1.
The storage nowadays is fast and throughput is high enough, plus can scale
dramatically behind the scene (also some of the “burst” functions). 2. The
query/processing engine optimization techniques skip unnecessary data so it doesn’t
need to always transfer all data into compute nodes. These 2 factors enable decoupled
architecture to handle big data storage and processing tasks with high performance.
Blendata - enterprise architecture and core engine
The simplified big data platform from our company is also not different in terms of
architecture for big data, previously we usually provided our platform with
coupled/aggregated compute & storage architecture with HDFS and YARN, the
open-source that many big data solution providers massively adopt to used as the
framework for store big amount of data. Nonetheless, with decoupled architecture
arrived, plus our engine optimization that keeps better and better performance along
the way, many enterprise storage vendors are already available in the market. We
decided to provide our recent customers with decoupled architecture and the feedback
is impressive. Some customers already save up to 300% of TCO compared to traditional
coupled architecture just by switching to this new architecture.
One of the reasons that we can provide a big data platform with decoupled architecture
is our core engine, optimized Apache spark®, the famous open-source In-memory big
data processing library that we specialized in and making a lot of modifications on its
core. Here are some features (both provided by native spark and enhanced by Blendata)
that directly benefit decoupled compute & storage architecture
Blendata’s optimized Spark - Based on Apache Spark v.2.x

● Columnar storage: We stored all data (internal table) with parquet file format as
default.
● Data partitioning: Process only data in the selected partition
● Cost-based optimization: Execute the job based on IOPs, throughput cost

● Dynamic partition pruning: Process data with dynamic strategy and planning
● Predicate pushdown: push down filter to file level instead of transfer to compute
level.
● Adaptive query execution: execute the job with an adaptive task distribution
strategy
These techniques are some of the enhancements that give us higher performance
compared to native Apache Spark (around 1.5x), yet it also leverages the standard
Apache Spark framework thus it provides abilities to scale and enable integration with
many technologies in the market in standard Spark ways.
Materials & Methods
Test environment
To get a promise and accurate results, we tested on the famous public cloud, Amazon
Web Service (AWS) environment. Following is the specification.
Compute nodes - AWS EC2 M5n series, we used a bunch of M5n including M5n.2xlarge,
M5n.4xlarge, M5n.8xlarge, M5n.16xlarge. These nodes are mounted with 1 x 500GB gp2
EBS storage for the system and local scratch path. M5n series that we used came with
up to 25gbps - 75gbps network interface (depends on M5n size)
Storage - We decided to use AWS FSx for Lustre. This storage can control throughput
and scalability for our use-case which yields comparable performance compared to
enterprise on-premise storage.
Data - we used 2 groups of data. 1. TPC-DS 1TB (1024 scale factor). TPC-DS is a widely
used, common standard data set for big data query engines. 2. Our generated 500M
records data.
Processing logic - we used 6 queries of TPC-DS and our 2 custom queries that
represented most of the jobs we used to process the data including scanning (wide
scan data; select and where statement), join, aggregate, and complex join. Following are
statements
Query 1: Wide scan
select * from store_sales where ss_item_sk = 1
Query 2: Full join
select count(*) from store_sales

join store_returns
on store_sales.ss_item_sk = store_returns.sr_item_sk
and store_sales.ss_ticket_number = store_returns.sr_ticket_number
Query 3: 2 Join
select
i_item_id,
ss_quantity,
ss_list_price,
ss_coupon_amt,
ss_sales_price,
ss_promo_sk,
ss_sold_date_sk
from
store_sales
join customer_demographics on (store_sales.ss_cdemo_sk =
customer_demographics.cd_demo_sk)
join item on (store_sales.ss_item_sk = item.i_item_sk)
where
cd_gender = 'F'
and cd_marital_status = 'W'
and cd_education_status = 'Primary'
and ss_sold_date_sk between 2450815 and 2451179
Query 4: 4 join
select
i_item_id,
ss_quantity,
ss_list_price,
ss_coupon_amt,
ss_sales_price
from
store_sales
join promotion on (store_sales.ss_promo_sk = promotion.p_promo_sk)
join date_dim on (ss_sold_date_sk = d_date_sk)
where
cd_gender = 'F'
and (p_channel_email = 'N'
or p_channel_event = 'N')
and d_year = 1998
Query 5: 4 joins with aggregate function
select
i_item_id,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
from
store_sales
where
cd_gender = 'F'
and d_year = 1998
group by
i_item_id
Query 6: 4 join with aggregate function with limit
select
i_item_id,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
from
store_sales
where
cd_gender = 'F'
and d_year = 1998

group by
i_item_id
order by
i_item_id
limit 100
Query 7: wide scan 3 billion records (TPC-DS data, less push down)
select * from store_sales

where ss_customer_sk = 3895006
Query 8: wide scan 500 million records (mocked data)
select * from mocked_data

where anumber = '655209664351'
Execute the test
To answer 2 questions, what’s the best way to scale the platform? Horizontal (scale-out;
multiple of small nodes) or Vertical (scale-up; few big nodes). And what’s the limitation
of each related factor (CPU, Storage, Network). So first, we need to set the benchmark
by testing on some sizing. Here is the result
Test 1: Set baseline to find what’s the best way to scale (Horizontal vs Vertical)
Testing environment
● Application node (Hera): M5n.2xlarge (8core 32gb) with 200GB gp2 EBS
● HPC engine node (Zeus and spark master): M5n.2xlarge (8core 32gb) with 200GB
gp2 EBS
● Worker node (Executor): M5n.8xlarge (32core 128gb) with 500GB gp2 EBS.
Configured executor - 2 cores 8GB, 12 instances. Total 24vCPU, 96GB executors.
● Storage: FSx for lustre, 2.4TB provisioned, 1000Mbps/ 1TB ssd type. Total 2400
MB/s throughput
*Please note that application and HPC engine nodes will always remain in this size for all
test scenarios.
Time usage CPU IOWait Outbound Storage Storage IOps

(s) consumption (avg. per network throughput (FSx)
(avg. per node) bandwidth (FSx) (aggregated)
node) (nload) (avg (aggregated) (Op/s)
+ agg all (MB/s)
nodes)
(Mbps)
Query 1 138 40% 22% 8940 450 639
Query 2 660 50% 30% 3750 1220 2019
Query 3 102 70% 10% 6500 560 2000
Query 4 102 70% 10% 7000 825 2800
Query 5 150 70% 40% 7210 689 2300
Query 6 150 70% 30% 7150 817 2800
Query 7 144 20% 24% 8903 1190 2500
Query 8 11.2 40% 20% 145 1190 2500
Our first test produced some interesting points such as high IOwait (20-40%) even
though storage throughput didn’t reach the maximum yet, and network bandwidth
didn’t reach the maximum capacity (25gbps). So, we suspected storage client
performance (the client that mounts network storage into compute nodes such as NFS,
and in this case Lustre client), we decided to scale up the compute resource to prove
this hypothesis and test the vertical or horizontal scalability. Following is the result
Test 2: Prove storage client bottleneck hypothesis with vertical expansion

● Worker node (Executor): M5n.16xlarge (64core 256gb, 75gbps NIC) with 500GB
gp2 EBS. Configured executor - 2 cores 8GB, 24 instances. Total 48vCores,
192GB executors.
● Storage: FSx for lustre, 2.4TB provisioned, 1000Mbps/ 1TB ssd type. Total 2400
MB/s throughput
Executor Time CPU IOWait Outbound Storage Storage IOps

vCores usage (s) consumption (avg. per network throughput (FSx)
node) (nload) (aggregated) (Op/s)
(avg+agg (MB/s)
all nodes)
(Mbps)
Query 1 24 138 40% 22% 8940 450 639
48 270 15% 40% 4660 622 1056

Query 2 24 660 50% 30% 3750 1220 2019
48 426 20% 60% 4660 600 2000
Query 3 24 102 70% 10% 6500 560 2000
48 126 20% 40% 4670 609 2001
Query 4 24 102 70% 10% 7000 825 2800
48 126 20% 40% 4670 622 2120
Query 5 24 150 70% 40% 7210 689 2300
48 144 20 40% 4670 622 2113
Query 6 24 150 70% 30% 7150 817 2800
48 132 20% 40% 4670 613 2113
Query 7 24 144 20% 24% 8903 1190 2500
48 270 15% 40% 4650 620 2000
Query 8 24 11.2 40% 20% 145 1190 2500
48 41 15% 50% 4670 602 1100
As you can see from the orange highlighted, the result stated it significantly performs
slower in scanning type queries (such as Query 1, 7, 8) compared to previous smaller
24vCores configured. It also didn’t provide near-linear faster processing time as it
should have performed for other types of queries (as all of them required to do scanning
tasks to transfer some data from storage to compute nodes). When we drill down to the
result, we found that IOwait increased significantly from 20%-30% to 40%-50% even
though storage throughput still held the line.
Thus, our hypothesis still remains the same. Storage’s client performance per node is
the bottleneck of the first two tests of scalability. So, we changed the hardware
environment from scale-up (one big node) to scale-out (many smaller nodes). Following
is the provisioned size and result.
Test 3: prove storage client bottleneck by horizontally scaling (scale-out)

- Worker node (Executor): 2 x M5n.8xlarge (32core 128gb, 25gbps NIC) with 500GB
192GB executors.
- Storage: FSx for lustre, 2.4TB provisioned, 1000Mbps/ 1TB ssd type. Total 2400
MB/s throughput
“24+24” rows are results from this environment

(avg + agg (MB/s)
all nodes)
(Mbps)
Query 1 24 138 40% 22% 8940 450 639
48 270 15% 40% 4660 622 1056
24+24 90 40% 20% 13570 1540 2500
Query 2 24 660 50% 30% 3750 1220 2019
48 426 20% 60% 4660 600 2000

24+24 336 70% 10% 14000 1740 3000
Query 3 24 102 70% 10% 6500 560 2000
48 126 20% 40% 4670 609 2001
24+24 72 60% 20% 11110 890 3000
Query 4 24 102 70% 10% 7000 825 2800
48 126 20% 40% 4670 622 2120
24+24 78 60% 20% 11130 696 2440
Query 5 24 150 70% 40% 7210 689 2300
48 144 20 40% 4670 622 2113
24+24 78 60% 20% 14910 388 1460
Query 6 24 150 70% 30% 7150 817 2800
48 132 20% 40% 4670 613 2113
24+24 78 60% 20% 16480 730 2640
Query 7 24 144 20% 24% 8903 1190 2500
48 270 15% 40% 4650 620 2000
24+24 84 40% 20% 13600 1100 2200
Query 8 24 11.2 40% 20% 145 1190 2500
48 41 15% 50% 4670 602 1100
24+24 11 60% 12% 14100 401 2200
The results from the new environment look promising. As all types of queries yield
near-linear scalability. This means our hypothesis about storage’s client is correct, as we
can significantly increase the performance just by changing the way we scale compute
nodes. The aggregated bandwidth from all nodes also reached ~16.4 Gbps (1,6xx-2,0xx
MB/s throughput, around 8Gbps per node) compared to the 24 vCores configurations
that consumed around ~8 Gbps.
However, as storage maximum throughput is 2400 MB/s, which means we almost hit
the maximum threshold and can’t scale compute nodes bigger than this setup. To prove
that storage throughput has a direct impact on performance. We perform another test
by scaling another compute node as follows
Test 4: Prove storage throughput bottleneck

288GB executors.
- Storage: FSx for lustre, 2.4TB provisioned, 1000Mbps/ 1TB ssd type. Total 2400
MB/s throughput

(avg + agg (MB/s)
all nodes)
(Mbps)
Query 1 24 138 40% 22% 8940 450 639
24+24 90 40% 20% 13570 1540 2500
24+24+24 90 25% 40% 18260 1780 3200

Query 2 24 660 50% 30% 3750 1220 2019
24+24 336 70% 10% 14000 1740 3000
24+24+24 336 20% 45% 31200 512 2660
Query 3 24 102 70% 10% 6500 560 2000
24+24 72 60% 20% 11110 890 3000
24+24+24 78 25% 40% 10500 827 2800
Query 4 24 102 70% 10% 7000 825 2800
24+24 78 60% 20% 11130 696 2440
24+24+24 72 25% 35% 8050 233 899
Query 5 24 150 70% 40% 7210 689 2300
24+24 78 60% 20% 14910 388 1460
24+24+24 72 25% 35% 12600 664 2600
Query 6 24 150 70% 30% 7150 817 2800
24+24 78 60% 20% 16480 730 2640
24+24+24 72 25% 35% 12600 664 2602
Query 7 24 144 20% 24% 8903 1190 2500
24+24 84 40% 20% 13600 1100 2200
24+24+24 78 17% 40% 15400 1550 2820
Query 8 24 11.2 40% 20% 145 1190 2500
24+24 11 60% 12% 14100 401 2200
24+24+24 12 25% 35% 18170 769 3800
As expected, 72 vCores (24+24+24) gave us almost a similar result as 48 vCores. IOwait

percentage also rose up to around 40%.
So, to prove that storage throughput reached the limit, we increased the FSx for lustre
throughput to 4,800 MB/s and re-test again with the same environment. Following is
the result.
Test 5: Increase storage throughput and re-test

288GB executors.
- Storage: FSx for lustre, 4.8TB provisioned, 1,000Mbps/ 1TB ssd type. Total
4,800 MB/s throughput

vCores usage consumption (avg. per network throughput (FSx)
(s) (avg. per node) bandwidth (FSx) (aggregated)
+ agg all (MB/s)
nodes)
(Mbps)
Query 1 24 138 40% 22% 8940 450 639
24+24 90 40% 20% 13570 1540 2500
24+24+24 90 25% 40% 18260 1780 3200
24+24+24 52.3 30% 30% 24000 N/A N/A

(2)
Query 2 24 660 50% 30% 3750 1220 2019
24+24 336 70% 10% 14000 1740 3000
24+24+24 336 20% 45% 31200 512 2660
24+24+24 276 30% 30% 18000 N/A N/A

(2)
Query 3 24 102 70% 10% 6500 560 2000
24+24 72 60% 20% 11110 890 3000
24+24+24 78 25% 40% 10500 827 2800
24+24+24 44.2 40% 20% 18000 N/A N/A

(2)
Query 4 24 102 70% 10% 7000 825 2800
24+24 78 60% 20% 11130 696 2440
24+24+24 72 25% 35% 8050 233 899
24+24+24 46 55% 15% 18000 N/A N/A

(2)
Query 5 24 150 70% 40% 7210 689 2300
24+24 78 60% 20% 14910 388 1460
24+24+24 72 25% 35% 12600 664 2600
24+24+24 52 50% 20% 15000 N/A N/A

(2)
Query 6 24 150 70% 30% 7150 817 2800
24+24 78 60% 20% 16480 730 2640
24+24+24 72 25% 35% 12600 664 2602
24+24+24 47 50% 20% 15000 N/A N/A

(2)
Query 7 24 144 20% 24% 8903 1190 2500
24+24 84 40% 20% 13600 1100 2200
24+24+24 78 17% 40% 15400 1550 2820
24+24+24 55 30% 30% 24000 N/A N/A

(2)
Query 8 24 11.2 40% 20% 145 1190 2500
24+24 11 60% 12% 14100 401 2200
24+24+24 12 25% 35% 18170 769 3800

24+24+24 11 20% 35% 14600 N/A N/A

(2)
As expected, we saw processing time reduced from 48vCores. Thus, we can conclude
that storage throughput is one of the related keys to performance. Please note that
there are some issues with AWS FSx so we can’t see throughput and IOPS metrics. We
also saw some minor issues on IOwait percentage that swing even in the same query,
and affect processing time. However, after we waited quite a long time and problems
disappeared, we can conclude this result (but didn’t re-test the whole scenario again).
Optimum configuration
After we find out all related factors that contributed to performance. We need to find the
optimum configuration such as vCores per node, storage, and network
throughput/bandwidth needed. First, we tested storage throughput with ‘fio’ software
to find out what’s the throughput per client. The command is ‘fio –name=seqread
–rw=read –direct=1 –ioengine=libaio –bs=1m –numjobs=100 –size=8m –runtime=120
–group_reporting’ result as follow.
Secondly, we ran multiple test scenarios to find optimum configuration for each node’s
executors. The goal is to configure until each node’s IOwait metrics reduce to nearly 0%.
Hence, we found that for 17xx storage bandwidth, 6 vCores executors per node is the
optimum configuration. To prove this calculation, we ran the same test scenario for each
configuration again to observe the performance and scalability. The Following is the
result.
Test 6, 7, 8: To find the optimum configuration

- Worker node (Executor): 2, 4, 8 x M5n.2xlarge (8core 32gb, up to 25gbps NIC)
with 500GB gp2 EBS. Configured executor - 2 cores 8GB, 6, 12, 24 instances.
Total 12, 24, 48 vCores, 48, 96, 192 GB executors.
- Storage: FSx for lustre, 4.8TB provisioned, 1,000Mbps/ 1TB ssd type. Total
4,800 MB/s throughput

(Total) (avg. per node) bandwidth (FSx) (aggregated)
+ agg all (MB/s)
nodes)
(Mbps)
Query 1 12 192 60% 8% 6600 1250 2000
24 96 60% 8% 12000 550 3300
48 51.9 60% 8% 25600 2500 4380
Query 2 12 990 70% 8% 6600 244 1091
24 534 60% 10% 12000 406 1700
48 354 70% 10% 11200 478 2200
Query 3 12 174 60% 7% 3400 544 2000
24 96 60% 12% 6400 641 2460
48 54 60% 10% 12800 544 2000
Query 4 12 174 60% 7% 3400 544 2000
24 90 60% 7% 6600 835 3000
48 53.6 60% 10% 13600 796 2600
Query 5 12 210 60% 10% 3400 456 1690
24 114 70% 10% 6800 824 3000
48 66 60% 10% 13600 478 1940

Query 6 12 210 60% 10% 3400 442 1570
24 108 70% 6% 6800 870 2980
48 58.2 60% 10% 13600 588 2180
Query 7 12 198 60% 6% 6000 857 2000
24 102 60% 6% 12000 1000 2200
48 52.7 60% 8% 24800 2250 3994
Query 8 12 26.8 45% 12% 7000 1200 3780
24 13.6 55% 7% 12000 1000 2200
48 7.6 60% 8% 27200 2400 3800
From the results, when we are configured to let each small node have only 6 executors
cores to run, the outcome yields great performance in scalability and processing power.
IOwait metrics also didn’t cross the 12% threshold compared to the previous 50% line.
The system also performs equal or even faster compared to previous bigger
configuration results such as Q1, Q7, Q8. on 48 vs 72 vCores.
Test Conclusion
Based on all 8 tests that we already performed and observed, we separate questions
into 4 topics. Following are details
What is the best way to design compute nodes for decoupled/dis-aggregated

compute and storage architecture, Horizontal or Vertical?
The horizontal design is the best architectural design that will give you almost ideal
scalability. Problems with the vertical design are the bottleneck of the operating system
and storage client/gateway for each node. This storage bottleneck problem will lead to
slow performance on data scanning tasks (aka. When a compute node pulls files from
storage) and will affect overall performance respectively. As you can see from the test,
one of the major problems of performance and scalability is IOwait, which means the
time that CPU must wait to receive requested data per node. Thus, we should leverage
high concurrent features of storage by provisioning multiple compute nodes and
mounting them to the same storage. Hence each node will simultaneously scan the
smaller portion of data, leading to higher performance.
How much vCPU/Core is needed per compute node?
The answer is it depends on your storage’s client performance. Previously when we

deployed our Blendata - enterprise on traditional Hadoop architecture consisting of
HDFS (Hadoop distributed file system) and YARN (yet another resource management;
resource management), there was no problem with storage bandwidth. As it tightly
coupled computing & storage together and its mechanism let the CPU directly read data
from each disk 1-by-1. Result in highly data locality with no-worry storage
bandwidth/throughput. But now, when we decouple compute from storage, we need to
mount the storage to compute by some storage protocol. Such as HDFS(only a
protocol), NFS, S3a, or in this case, Lustre client. Thus, it’s like the single point of storage
that needs to be queued to access the data. To calculate ideal executor cores per node,
you can use storage bandwidth per client / 300 MB/s (or 250MB/s at least). This came
from all tests and especially test number 6,7,8 as we reduce IOwait to under 10%
threshold, then calculate the number of CPU cores per storage bandwidth. So, based on
this test, with around 1800 MB/s storage bandwidth per client via fio test, It’s
recommended to have 6-8 executor cores per node.
How much storage throughput is needed?
Storage throughput is mostly required when in the scanning stage (pulling the data
from storage). The higher the throughput, the faster and bigger pool of executors cores
that can parallel scan. You can observe only query numbers 1, 7, and 8 which are all
major scanning tasks. They will always consume the most throughput/bandwidth
compared to other queries (The AWS’s FSx didn’t provide a high accuracy number of
throughput as you may have noticed on all tests. We recommended you use a network
bandwidth column divided by 8-10 to convert to MB/s instead). Although to keep it
simple, based on our recommendation from the test result, every 1 executor will burst
around 40-70 MB/s of throughput. You can use this guideline to estimate the storage
throughput needed. (for example, if you want to have 48 executor cores, you need 48 x
60 MB/s = 2,880 MB/s throughput requirement for storage). Please note that if storage
throughput is lower than the guideline, it will not break. But, it will slow down the
performance of the compute cluster. (higher IOWait%)
How much data is transfer between computing and storage
Again, it depends on the data characteristics. Blendata - enterprise v3.5 bundled lots of
advanced data management and data skipping techniques such as data partitioning,
predicate pushdown on file level, parquet dictionary (data statistic; a range of data),
cost-based optimization, dynamic partition pruning. All of these aim at one goal, to
reduce data transfer from storage to compute on the scanning stage as much as
possible. However, the size of data that can be skipped is directly related to the
characteristics of each data. E.g., If data have low cardinality (aka few unique values) on
a high amount of data, then the data can’t be skipped that much. But if data have high
cardinality, the result will be contrariwise. Based on our test, we found that we can
reduce the transfer data up to 0.1x to 20x compared to stored data.
Epilogue
With the rise of decoupled/dis-aggregated compute & storage architecture for big data
platforms, there are so many challenges and questions from technical to executive
perspective to transition or adopt a big data platform with this modernized architecture.
We hope our information and guidelines can help you design and make less effort to
strike through technical challenges on Blendata’s technologies and platforms.
Lastly, As a big data technology company, we believe that there is a lot of room to be
improved in big data and its related fields. The best architecture and technologies today
may not be the one that stands at the top tomorrow. We promise to keep providing you
with the best technologies to be a component that contributes to your success and turn
your company into a data-driven organization and reach the ultimate goals.
We believe that there are always hidden opportunities within your data.
Blendata

BDEv3.5 Perf. Benchmark and Architect Design

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BDEv3.5 Perf. Benchmark and Architect Design

Uploaded by

Copyright:

Available Formats

Blendata enterprise v.3.5.

Blendata enterprise v.3.5.x decoupled/dis-aggregated compute & storage

Blendata - enterprise architecture design is based on a distributed architecture that

What is the best way to design compute nodes for decoupled/dis-aggregated

How much vCPU/Core is needed per compute node?

The answer is it depends on your storage’s client performance. Previously when we

resource management), there was no problem with storage bandwidth. As it tightly

How much storage throughput is needed?

How much data is transfer between computing and storage

Nowadays, Big data platforms always leverage various algorithms, methods, or

Technologies & methodologies background

Traditional big data platform architecture (coupled/aggregated compute &

Hadoop-provider company deciding to change their business model. (That's the

Decoupled/Dis-aggregated compute & storage architecture

Blendata - enterprise architecture and core engine

Blendata’s optimized Spark - Based on Apache Spark v.2.x

● Cost-based optimization: Execute the job based on IOPs, throughput cost

Materials & Methods

Query 1: Wide scan

select * from store_sales where ss_item_sk = 1

Query 2: Full join

select count(*) from store_sales

Query 5: 4 joins with aggregate function

Query 6: 4 join with aggregate function with limit

and d_year = 1998

select * from store_sales

Query 8: wide scan 500 million records (mocked data)

select * from mocked_data

Execute the test

Time usage CPU IOWait Outbound Storage Storage IOps

Query 1 138 40% 22% 8940 450 639

Query 2 660 50% 30% 3750 1220 2019

Query 3 102 70% 10% 6500 560 2000

Query 4 102 70% 10% 7000 825 2800

Query 5 150 70% 40% 7210 689 2300

Query 6 150 70% 30% 7150 817 2800

Query 7 144 20% 24% 8903 1190 2500

Query 8 11.2 40% 20% 145 1190 2500

Test 2: Prove storage client bottleneck hypothesis with vertical expansion

Executor Time CPU IOWait Outbound Storage Storage IOps

Query 1 24 138 40% 22% 8940 450 639

48 270 15% 40% 4660 622 1056

Query 2 24 660 50% 30% 3750 1220 2019

48 426 20% 60% 4660 600 2000

Query 3 24 102 70% 10% 6500 560 2000

48 126 20% 40% 4670 609 2001

Query 4 24 102 70% 10% 7000 825 2800

48 126 20% 40% 4670 622 2120

Query 5 24 150 70% 40% 7210 689 2300

48 144 20 40% 4670 622 2113

Query 6 24 150 70% 30% 7150 817 2800

48 132 20% 40% 4670 613 2113

Query 7 24 144 20% 24% 8903 1190 2500

48 270 15% 40% 4650 620 2000

Query 8 24 11.2 40% 20% 145 1190 2500

48 41 15% 50% 4670 602 1100

Test 3: prove storage client bottleneck by horizontally scaling (scale-out)

“24+24” rows are results from this environment

Executor Time CPU IOWait Outbound Storage Storage IOps

Query 1 24 138 40% 22% 8940 450 639

48 270 15% 40% 4660 622 1056

24+24 90 40% 20% 13570 1540 2500

Query 2 24 660 50% 30% 3750 1220 2019

48 426 20% 60% 4660 600 2000