Professional Documents
Culture Documents
BDEv3.5 Perf. Benchmark and Architect Design
BDEv3.5 Perf. Benchmark and Architect Design
x decoupled/dis-aggregated
compute & storage architectural design.
Executive summary
The test was run on Amazon Web Service (AWS) public cloud environment for
standardized purposes of the whole result. Comprising a bunch of AWS’s EC2 instances
with multiple types, and AWS’s FSx for Lustre for storage. The test data was based on
the industry standard, TPC-DS, with some of the TPC-DS queries that represent
real-world scenarios including wide scan, aggregate, and join.
Please note that for the accurate result, we recommend testing based on your
environment to maximize the best configuration and design. As lots of factors may
affect your hypothesis and result. However, this document tries to provide the best
guideline for design in the environment stated above. Following is the result
The horizontal design is the best architectural design that will give you almost ideal
scalability. Problems with the vertical design are the bottleneck of the operating system
and storage client/gateway for each node. This storage bottleneck problem will lead to
slow performance on data scanning tasks (aka. When a compute node pulls files from
storage) and will affect overall performance respectively. Thus, we should leverage high
concurrent features of storage by provisioning multiple compute nodes and mounting
them to the same storage. Hence each node will simultaneously scan the smaller
portion of data, leading to higher performance.
Storage throughput is mostly required when in the scanning stage (pulling the data
from storage). The higher the throughput, the faster and bigger pool of executors cores
that can parallel scan. Although to keep it simple, based on our recommendation from
the test result, every 1 executor will burst around 40-70 MB/s of throughput. You can
use this guideline to estimate the storage throughput needed. (for example, if you want
to have 48 executor cores, you need 48 x 60 MB/s = 2,880 MB/s throughput
requirement for storage). Please note that if storage throughput is lower than the
guideline, it will not break. But, it will slow down the performance of the compute
cluster. (higher IOWait%)
Again, it depends on the data characteristics. Blendata - enterprise v3.5 bundled lots of
advanced data management and data skipping techniques such as data partitioning,
predicate pushdown on file level, parquet dictionary (data statistic; a range of data),
cost-based optimization, dynamic partition pruning. All of these aim at one goal, to
reduce data transfer from storage to compute on the scanning stage as much as
possible. However, the size of data that can be skipped is directly related to the
characteristics of each data. E.g., If data have low cardinality (aka few unique values) on
a high amount of data, then the data can’t be skipped that much. But if data have high
cardinality, the result will be contrariwise. Based on our test, we found that we can
reduce the transfer data up to 0.1x to 20x compared to stored data.
We hope that this guideline will help you craft your or your partner's big data platform in
an optimum way. Any kind of request or suggestion on this matter can be done via an
email: support@blendata.co or directly call your sales representative.
Blendata enterprise v.3.5.x decoupled/dis-aggregated
compute & storage architectural design.
Introduction
Background
Objectives
1. To find out what’s the best architecture for dis-aggregated compute & storage
architecture
2. To find the limitations or factors to design the number of both compute &
storage resource requirements.
Blendata enterprise v.3.5.x decoupled/dis-aggregated
compute & storage architectural design.
5-10 years ago, big data platform adoption increased significantly due to the rise of data
volume, variety, and velocity, and game-changing open source technology “Hadoop”
that made the cost of ownership to be more reasonable. Hadoop technologies consist of
2 main components, the distributed file system layer called Hadoop Distributed File
System (HDFS) and the processing layer called MapReduce (later developed into YARN.
The bundled pack of processing API & resource management). Other than being the
technology that can leverage commodity hardware by integrating them and providing a
single cluster (resulting in lower cost compared to other big data platform vendors that
needed to use proprietary, expensive hardware), The key concept of HDFS is to make
data always available nearest to compute power. By splitting data into chunks, call
Blocks, and spread these blocks to many nodes in the cluster. The CPU in each node will
have the capability to directly read the data from the disk in the same node, and
simultaneously process it with other nodes on the same job. The result is a
high-performance, parallel computing platform with high and near-infinite (ideally)
throughput and scalability. Following is the diagram of a high-level Hadoop architecture
Many organizations adopted this technology and enabled lots of successful data
use-cases. However, from time to time, some challenges occurred with the adoption of
this technology. We will not address in detail about the vendor lock-in issue and license
model adjustment during these 2-3 years that happened due to some merged
Blendata enterprise v.3.5.x decoupled/dis-aggregated
compute & storage architectural design.
Actually, this methodology is not new. You can find similar concepts in the legacy
Massive parallel processing (MPP) or HPC cluster in scientific, mathematics industries,
or even legacy computing applications based on server & network array storage (NAS).
However, the main reason this architecture is not the default standard in the previous
decade is because of technology limitations only. Following is the diagram of high-level
architecture.
Blendata enterprise v.3.5.x decoupled/dis-aggregated
compute & storage architectural design.
As you can see from the above picture, the main channel between compute & storage is
the network that bridges the resource together. This led to bottleneck problems when
transferring all data into compute nodes. That being said, even this architecture does
not look like it fits in the big data era. The benefits themselves still shine through the
cons due to the flexibility of the architecture. Users can scale only storage if they have
requirements to store more data, or can scale only compute resources if they have more
processing tasks or jobs. This concept of scale can’t be possible in coupled,
Hadoop-based architectures due to it being required to scale node by node only.
From our observation, this architecture became famous and widely adopted by the
game-changing technology, the Cloud-based computing era. Users can store as much
big data as they like on serverless blob storage such as AWS S3, Azure blob storage, and
Google Cloud Storage, while they can spawn only some computations for the jobs it
requires such as pure AWS EC2, Query service such as Athena, etc. This architecture
can be possible for big data stores and processing tasks because of 2 major reasons. 1.
The storage nowadays is fast and throughput is high enough, plus can scale
dramatically behind the scene (also some of the “burst” functions). 2. The
query/processing engine optimization techniques skip unnecessary data so it doesn’t
need to always transfer all data into compute nodes. These 2 factors enable decoupled
architecture to handle big data storage and processing tasks with high performance.
The simplified big data platform from our company is also not different in terms of
architecture for big data, previously we usually provided our platform with
coupled/aggregated compute & storage architecture with HDFS and YARN, the
open-source that many big data solution providers massively adopt to used as the
framework for store big amount of data. Nonetheless, with decoupled architecture
arrived, plus our engine optimization that keeps better and better performance along
the way, many enterprise storage vendors are already available in the market. We
decided to provide our recent customers with decoupled architecture and the feedback
is impressive. Some customers already save up to 300% of TCO compared to traditional
coupled architecture just by switching to this new architecture.
One of the reasons that we can provide a big data platform with decoupled architecture
is our core engine, optimized Apache spark®, the famous open-source In-memory big
data processing library that we specialized in and making a lot of modifications on its
core. Here are some features (both provided by native spark and enhanced by Blendata)
that directly benefit decoupled compute & storage architecture
These techniques are some of the enhancements that give us higher performance
compared to native Apache Spark (around 1.5x), yet it also leverages the standard
Apache Spark framework thus it provides abilities to scale and enable integration with
many technologies in the market in standard Spark ways.
Blendata enterprise v.3.5.x decoupled/dis-aggregated
compute & storage architectural design.
Test environment
To get a promise and accurate results, we tested on the famous public cloud, Amazon
Web Service (AWS) environment. Following is the specification.
Compute nodes - AWS EC2 M5n series, we used a bunch of M5n including M5n.2xlarge,
M5n.4xlarge, M5n.8xlarge, M5n.16xlarge. These nodes are mounted with 1 x 500GB gp2
EBS storage for the system and local scratch path. M5n series that we used came with
up to 25gbps - 75gbps network interface (depends on M5n size)
Storage - We decided to use AWS FSx for Lustre. This storage can control throughput
and scalability for our use-case which yields comparable performance compared to
enterprise on-premise storage.
Data - we used 2 groups of data. 1. TPC-DS 1TB (1024 scale factor). TPC-DS is a widely
used, common standard data set for big data query engines. 2. Our generated 500M
records data.
Processing logic - we used 6 queries of TPC-DS and our 2 custom queries that
represented most of the jobs we used to process the data including scanning (wide
scan data; select and where statement), join, aggregate, and complex join. Following are
statements
Query 3: 2 Join
select
i_item_id,
Blendata enterprise v.3.5.x decoupled/dis-aggregated
compute & storage architectural design.
ss_quantity,
ss_list_price,
ss_coupon_amt,
ss_sales_price,
ss_promo_sk,
ss_sold_date_sk
from
store_sales
join customer_demographics on (store_sales.ss_cdemo_sk =
customer_demographics.cd_demo_sk)
join item on (store_sales.ss_item_sk = item.i_item_sk)
where
cd_gender = 'F'
and cd_marital_status = 'W'
and cd_education_status = 'Primary'
and ss_sold_date_sk between 2450815 and 2451179
Query 4: 4 join
select
i_item_id,
ss_quantity,
ss_list_price,
ss_coupon_amt,
ss_sales_price
from
store_sales
join customer_demographics on (store_sales.ss_cdemo_sk =
customer_demographics.cd_demo_sk)
join item on (store_sales.ss_item_sk = item.i_item_sk)
join promotion on (store_sales.ss_promo_sk = promotion.p_promo_sk)
join date_dim on (ss_sold_date_sk = d_date_sk)
where
cd_gender = 'F'
and cd_marital_status = 'W'
and cd_education_status = 'Primary'
and (p_channel_email = 'N'
or p_channel_event = 'N')
and d_year = 1998
and ss_sold_date_sk between 2450815 and 2451179
select
Blendata enterprise v.3.5.x decoupled/dis-aggregated
compute & storage architectural design.
i_item_id,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
from
store_sales
join customer_demographics on (store_sales.ss_cdemo_sk =
customer_demographics.cd_demo_sk)
join item on (store_sales.ss_item_sk = item.i_item_sk)
join promotion on (store_sales.ss_promo_sk = promotion.p_promo_sk)
join date_dim on (ss_sold_date_sk = d_date_sk)
where
cd_gender = 'F'
and cd_marital_status = 'W'
and cd_education_status = 'Primary'
and (p_channel_email = 'N'
or p_channel_event = 'N')
and d_year = 1998
and ss_sold_date_sk between 2450815 and 2451179
group by
i_item_id
select
i_item_id,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
from
store_sales
join customer_demographics on (store_sales.ss_cdemo_sk =
customer_demographics.cd_demo_sk)
join item on (store_sales.ss_item_sk = item.i_item_sk)
join promotion on (store_sales.ss_promo_sk = promotion.p_promo_sk)
join date_dim on (ss_sold_date_sk = d_date_sk)
where
cd_gender = 'F'
and cd_marital_status = 'W'
and cd_education_status = 'Primary'
and (p_channel_email = 'N'
or p_channel_event = 'N')
Blendata enterprise v.3.5.x decoupled/dis-aggregated
compute & storage architectural design.
Query 7: wide scan 3 billion records (TPC-DS data, less push down)
To answer 2 questions, what’s the best way to scale the platform? Horizontal (scale-out;
multiple of small nodes) or Vertical (scale-up; few big nodes). And what’s the limitation
of each related factor (CPU, Storage, Network). So first, we need to set the benchmark
by testing on some sizing. Here is the result
Test 1: Set baseline to find what’s the best way to scale (Horizontal vs Vertical)
Testing environment
● Application node (Hera): M5n.2xlarge (8core 32gb) with 200GB gp2 EBS
● HPC engine node (Zeus and spark master): M5n.2xlarge (8core 32gb) with 200GB
gp2 EBS
● Worker node (Executor): M5n.8xlarge (32core 128gb) with 500GB gp2 EBS.
Configured executor - 2 cores 8GB, 12 instances. Total 24vCPU, 96GB executors.
● Storage: FSx for lustre, 2.4TB provisioned, 1000Mbps/ 1TB ssd type. Total 2400
MB/s throughput
*Please note that application and HPC engine nodes will always remain in this size for all
test scenarios.
Blendata enterprise v.3.5.x decoupled/dis-aggregated
compute & storage architectural design.
Our first test produced some interesting points such as high IOwait (20-40%) even
though storage throughput didn’t reach the maximum yet, and network bandwidth
didn’t reach the maximum capacity (25gbps). So, we suspected storage client
performance (the client that mounts network storage into compute nodes such as NFS,
and in this case Lustre client), we decided to scale up the compute resource to prove
this hypothesis and test the vertical or horizontal scalability. Following is the result
As you can see from the orange highlighted, the result stated it significantly performs
slower in scanning type queries (such as Query 1, 7, 8) compared to previous smaller
24vCores configured. It also didn’t provide near-linear faster processing time as it
should have performed for other types of queries (as all of them required to do scanning
tasks to transfer some data from storage to compute nodes). When we drill down to the
result, we found that IOwait increased significantly from 20%-30% to 40%-50% even
though storage throughput still held the line.
Blendata enterprise v.3.5.x decoupled/dis-aggregated
compute & storage architectural design.
Thus, our hypothesis still remains the same. Storage’s client performance per node is
the bottleneck of the first two tests of scalability. So, we changed the hardware
environment from scale-up (one big node) to scale-out (many smaller nodes). Following
is the provisioned size and result.
The results from the new environment look promising. As all types of queries yield
near-linear scalability. This means our hypothesis about storage’s client is correct, as we
can significantly increase the performance just by changing the way we scale compute
nodes. The aggregated bandwidth from all nodes also reached ~16.4 Gbps (1,6xx-2,0xx
MB/s throughput, around 8Gbps per node) compared to the 24 vCores configurations
that consumed around ~8 Gbps.
Blendata enterprise v.3.5.x decoupled/dis-aggregated
compute & storage architectural design.
However, as storage maximum throughput is 2400 MB/s, which means we almost hit
the maximum threshold and can’t scale compute nodes bigger than this setup. To prove
that storage throughput has a direct impact on performance. We perform another test
by scaling another compute node as follows
So, to prove that storage throughput reached the limit, we increased the FSx for lustre
throughput to 4,800 MB/s and re-test again with the same environment. Following is
the result.
As expected, we saw processing time reduced from 48vCores. Thus, we can conclude
that storage throughput is one of the related keys to performance. Please note that
there are some issues with AWS FSx so we can’t see throughput and IOPS metrics. We
also saw some minor issues on IOwait percentage that swing even in the same query,
and affect processing time. However, after we waited quite a long time and problems
disappeared, we can conclude this result (but didn’t re-test the whole scenario again).
Optimum configuration
After we find out all related factors that contributed to performance. We need to find the
optimum configuration such as vCores per node, storage, and network
throughput/bandwidth needed. First, we tested storage throughput with ‘fio’ software
to find out what’s the throughput per client. The command is ‘fio –name=seqread
–rw=read –direct=1 –ioengine=libaio –bs=1m –numjobs=100 –size=8m –runtime=120
–group_reporting’ result as follow.
Blendata enterprise v.3.5.x decoupled/dis-aggregated
compute & storage architectural design.
Secondly, we ran multiple test scenarios to find optimum configuration for each node’s
executors. The goal is to configure until each node’s IOwait metrics reduce to nearly 0%.
Hence, we found that for 17xx storage bandwidth, 6 vCores executors per node is the
optimum configuration. To prove this calculation, we ran the same test scenario for each
configuration again to observe the performance and scalability. The Following is the
result.
From the results, when we are configured to let each small node have only 6 executors
cores to run, the outcome yields great performance in scalability and processing power.
IOwait metrics also didn’t cross the 12% threshold compared to the previous 50% line.
The system also performs equal or even faster compared to previous bigger
configuration results such as Q1, Q7, Q8. on 48 vs 72 vCores.
Blendata enterprise v.3.5.x decoupled/dis-aggregated
compute & storage architectural design.
Blendata enterprise v.3.5.x decoupled/dis-aggregated
compute & storage architectural design.
Test Conclusion
Based on all 8 tests that we already performed and observed, we separate questions
into 4 topics. Following are details
The horizontal design is the best architectural design that will give you almost ideal
scalability. Problems with the vertical design are the bottleneck of the operating system
and storage client/gateway for each node. This storage bottleneck problem will lead to
slow performance on data scanning tasks (aka. When a compute node pulls files from
storage) and will affect overall performance respectively. As you can see from the test,
one of the major problems of performance and scalability is IOwait, which means the
time that CPU must wait to receive requested data per node. Thus, we should leverage
high concurrent features of storage by provisioning multiple compute nodes and
mounting them to the same storage. Hence each node will simultaneously scan the
smaller portion of data, leading to higher performance.
Storage throughput is mostly required when in the scanning stage (pulling the data
from storage). The higher the throughput, the faster and bigger pool of executors cores
that can parallel scan. You can observe only query numbers 1, 7, and 8 which are all
major scanning tasks. They will always consume the most throughput/bandwidth
compared to other queries (The AWS’s FSx didn’t provide a high accuracy number of
Blendata enterprise v.3.5.x decoupled/dis-aggregated
compute & storage architectural design.
throughput as you may have noticed on all tests. We recommended you use a network
bandwidth column divided by 8-10 to convert to MB/s instead). Although to keep it
simple, based on our recommendation from the test result, every 1 executor will burst
around 40-70 MB/s of throughput. You can use this guideline to estimate the storage
throughput needed. (for example, if you want to have 48 executor cores, you need 48 x
60 MB/s = 2,880 MB/s throughput requirement for storage). Please note that if storage
throughput is lower than the guideline, it will not break. But, it will slow down the
performance of the compute cluster. (higher IOWait%)
Again, it depends on the data characteristics. Blendata - enterprise v3.5 bundled lots of
advanced data management and data skipping techniques such as data partitioning,
predicate pushdown on file level, parquet dictionary (data statistic; a range of data),
cost-based optimization, dynamic partition pruning. All of these aim at one goal, to
reduce data transfer from storage to compute on the scanning stage as much as
possible. However, the size of data that can be skipped is directly related to the
characteristics of each data. E.g., If data have low cardinality (aka few unique values) on
a high amount of data, then the data can’t be skipped that much. But if data have high
cardinality, the result will be contrariwise. Based on our test, we found that we can
reduce the transfer data up to 0.1x to 20x compared to stored data.
Blendata enterprise v.3.5.x decoupled/dis-aggregated
compute & storage architectural design.
Epilogue
With the rise of decoupled/dis-aggregated compute & storage architecture for big data
platforms, there are so many challenges and questions from technical to executive
perspective to transition or adopt a big data platform with this modernized architecture.
We hope our information and guidelines can help you design and make less effort to
strike through technical challenges on Blendata’s technologies and platforms.
Lastly, As a big data technology company, we believe that there is a lot of room to be
improved in big data and its related fields. The best architecture and technologies today
may not be the one that stands at the top tomorrow. We promise to keep providing you
with the best technologies to be a component that contributes to your success and turn
your company into a data-driven organization and reach the ultimate goals.
We believe that there are always hidden opportunities within your data.
Blendata
Blendata enterprise v.3.5.x decoupled/dis-aggregated
compute & storage architectural design.