Benchmarking Criteria For A Cloud Data Warehouse

ISSN: 2640-4133
Review Article Advances in Bioengineering & Biomedical Science Research

Benchmarking Criteria for A Cloud Data Warehouse
Seyed Ebrahim Dashti1* and Ahmed Rahman Abdulzahra2
1
Department of Computer Engineering, Jahrom Branch,
Islamic Azad University *
Corresponding Author
Seyed Ebrahim Dashti, Department of Computer Engineering, Jahrom
2
Department of Computer Engineering, shiraz Branch, Branch, Islamic Azad University.
Islamic Azad University
Submitted: 2023, April 12; Accepted: 2023, May 23; Published: 2023, Sep 19
Citation: Dashti, S. E., Abdulzahra, A. R. (2023). Benchmarking Criteria for A Cloud Data Warehouse. Adv Bioeng Biomed Sci
Res, 6(9), 148-158.
Abstract
The Terasort benchmark and the YCSB benchmark are the two most used cloud computing benchmarks. Despite the fact
that these benchmarks are quite helpful, data warehouse systems and associated OLAP technologies were not the focus
of their creation. We initially introduce cloud computing and data warehouse systems in this essay. Then, we contend
that the TPC-H benchmark, which is the most well- known benchmark for decision support systems, conflicts with the
justifications for cloud computing (scalability, elasticity, pay-per-use, fault-tolerance characteristics), and customer
relationship management (end- user satisfaction, Quality of Service features). Finally, we propose updated specifica-
tions for a benchmark for cloud data warehouse systems. The suggested specifications ought to make it possible to fairly
compare the products offered by various cloud system providers.
Keywords: Data Warehouse, OLAP, Cloud, TPC-H, Bench- Mark.
1. Introduction increase over 2010 revenue of US$10.5 billion, placing it as the
By creating quantitative methods for a business to make optimal year's second-fastest growing category. According to Gartner,
judgments and execute business knowledge discovery, business the market for BI systems will continue to be one of the most
intelligence strives to enable better decision-making. Data from rapidly expanding software sectors in most areas (refer to [3]
data warehouse systems is frequently used by business intelli- for details). However, working with Big Data has its challenges.
gence to provide historical, real-time, and forecasted perspec- Ralph Kimball also described big data as a paradigm shift.
tives of corporate activities. However, data warehousing is quite
expensive because it calls for specialists, sophisticated gear, When considering data assets, we should consider where to get
and cutting-edge software. Terabytes of data are being wasted them, how to analyze them, and how to profit from the analysis's
by certain firms because they have insufficient human, software, findings.
and hardware resources for data analytics. With pay-as-you- go
cloud computing now available, decision support systems have As a result, one of the main drivers of the growth of big data is
more potential than ever before. financial, and decision support systems must address the four
V-dimensions of big data: I volume, which is the challenge of
Many research organizations, like Forrester and Gartner, predict managing large amounts of data; (ii) velocity, which is the chal-
a significant investment in cloud technology in the near future lenge of how quickly data is analyzed; (iii) variety, which is the
due to the rising cloud computing business. In fact, according challenge of handling unstructured, semi-structured, and rela-
to Forrester Research and Gartner Group, the global market for tional data; and (iv), veracity, which is the challenge of seman-
cloud computing is predicted to reach $241 billion in 2020 and tics and and variability meaning in language.
$US150.1 billion in 2013, respectively [1,2]. Additionally, the
market for business intelligence is still expanding, and infor- Recently, cloud computing has become quite popular, and many
mation analysts are prepared with OLAP principles and relat- businesses now provide a range of public cloud computing ser-
ed technology (Microsoft Analysis Services, Oracle Business vices based on NoSQL, extended RDBMS, and standard rela-
Intelligence, Pentaho BI suite, SAP NetWeaver). Business in- tional DBMS technology. The cost to operate, maintain, and
telligence platforms, analytical applications, and performance improve traditional software technologies can be fairly high.
management software saw global revenue of US$12.2 billion In order to handle large data analytics, two architectures—the
in 2011, according to the most recent enterprise software report extended RDBMS and the NoSQL technologies (Apache Ha-
from Gartner. doop/MapReduce framework)—have emerged. Columnar stor-
age systems and massively parallel processing (MPP) are archi-
In the broader global enterprise software industry, this is a 16.4% tectural advancements for extended RDBMS. storage devices.
Adv Bioeng Biomed Sci Res, 2023 Volume 6 | Issue 9 | 148

NoSQL has become a more significant component of Big Data MS SQL Server database system using TPC-H benchmark in
trends, and a number of NoSQL solutions are surfacing with order to compare SQL technologies to NoSQL technologies. We
wildly varying feature sets. Customers find it difficult to select suggested OLAP cloud situations in [7,8]. The suggested scenar-
the right cloud provider for their applications since cloud pro- ios seek to balance space, bandwidth, and computing overheads
viders differ in their service models and price structures. The while allowing for best performance. The TPC-H benchmark is
expectations placed on cloud technologies by data warehouse used to evaluate Apache Hadoop/Pig Latin across a range of data
systems are new and different, and vice versa. In this article, we volumes, workloads, and cluster sizes.
suggest new standards for unbiased testing of cloud data ware-
housing systems. There are several cloud computing benchmarks, however they
don't have the same goals as data warehouse systems. The Tera-
The following is the paper's outline: In order to highlight our Sort benchmark, for instance, calculates how long it takes to sort
contribution, we first review related work in section II. Then, 1 TB (10 billion 100B records) of randomly generated data [9].
we give the preliminary results for the data warehouse and cloud It is used to evaluate the performance of NoSQL storage systems
computing systems. In Section III, we review the key aspects of like Hadoop and MapReduce. The Yahoo Cloud Serving Bench-
cloud computing and the need for a benchmark for data ware- mark (YCSB) evaluates the speed and scalability of cloud stor-
house systems; in Section IV, we provide a brief overview of age solutions, including HBase, the column-oriented database of
data warehouse systems and the TPC-H benchmark. The latter the Hadoop project [10].
serves as the decision support system's primary benchmark.
The CloudStone Benchmark evaluates social computing apps
We contend that the existing TPC-H specification is inconsistent running on a cloud and is made to support Web 2.0 style ap-
with the cloud logic (scalability, elasticity, pay-per-use, fault-tol- plications. MalStone is a performance evaluation tool for cloud
erance features) (end-user satisfaction, Quality of Service fea- computing middleware for data analytics that supports the kind
tures). In section V, we propose updated specifications for put- of data-intensive computation that is typical when developing
ting into practice a benchmark for cloud-based data warehousing data mining models [11,12].
systems. With the help of the suggested benchmark, it should be
possible to fairly compare various cloud systems, tune a cloud Binnig et al. outline the preliminary requirements for a web-
system for a certain Cloud Service Provider (CSP), and choose store benchmark (i.e., OLTP workload) in [13]. They propose
the optimum optimizations and cost-performance tradeoffs. Fi- new measures for examining the cost, fault tolerance, and scal-
nally, we wrap up the essay and discuss upcoming projects. ability of cloud services. Later, in, they provide a list of possible
architectures to implement cloud computing for web-store data-
2. Related Work base applications and present the findings of a thorough assess-
Published studies concentrated on a few unique characteristics ment of available commercial cloud services [14]. They evaluat-
of data warehouses. In fact, Forrester published a Cost Analysis ed the products from Amazon, Google, and Microsoft using the
Tool comparing internal file storage to the cloud. Nguyen et al. TPC-W benchmark's database and workload.
suggest cost models for Views Materialization in the cloud using
the Excel Workbook as a tool for comparing storage on- premis- The goal of the CloudCMP project is to contrast the costs and
es and in the cloud [4,5]. The proposed cost models are compat- performance of various cloud service providers [15]. Four com-
ible with cloud computing's pay-as- you-go methodology. Un- mon services are combined to represent a cloud in its model,
der financial restrictions, these cost models assist in achieving a including Two services are available: (1) Elastic Computer Clus-
multi-criteria optimization of the view materialization vs. CPU ter Service, which uses an elastic number of virtual instances
power consumption problem. to handle workloads, and (2) Persistent Storage Service, which
houses application data. Table (SQL and NoSQL storage are tak-
There aren't many articles discussing how to handle and evaluate en into account), blob (binary files), and queue messages (as for
OLAP workloads on cloud platforms using performance mea- Windows Azur) are some examples of different types of storage
surement. We then review a range of research initiatives pertain- services; (3) Intra-cloud Network Service: the internal cloud net-
ing to cloud experimentation, To educate cloud users about the work that links application virtual instances (4) WAN Service:
high cost of utilizing freeware software in the cloud, Floratou A cloud's wide-area delivery network transfers data from several
et al. undertook a series of experiments comparing the cost of geographically dispersed data centers to the end hosts where an
deployment in the cloud of various DBMSs [6]. For instance, application is running. The project's scope is broad; benchmark-
they ran Wisconsin Benchmark Question 21 and compared the ing data warehouses in the cloud specifically is not included.
open-source MySQL database to the for-profit MS SQL Server
database's response time. The user must pay a licensing price on The majority of published research concentrated on benchmark-
an hourly basis for the SQL Server-based service, but not for the ing through analysis of cost models for specific cloud topics or
use of MySQL. While MySQL executes Q21 in 621 seconds, evaluation of high-level language and platform performance
MS SQL server does it in 185 seconds. Obviously, this 3.3X measurements. We demonstrate in this work that the popular
performance disparity will have an impact on the end-user cost. benchmark for decision support systems, TPC-H, mismatches
both I the cloud reasoning (scalability, elasticity, pay-per-use,
Pavlo et al. tested the performance of Apache Hadoop/Hive to fault-tolerance characteristics) and (ii) the customer relationship

management rationale (end-user satisfaction, Quality of Service and databases, respectively.
features). The novel cloud services cannot be evaluated using its
measures, in fact. In addition, we provide additional measures C) CSP Pricing Models
that are appropriate for OLAP workloads and cloud computing Despite the fact that many services appear to be similar on the
features. Making the capabilities and services of CSPs' offers surface, they differ in terms of system topologies, performance,
comparable is the main difficulty of the proposed standards and scalability, and cost. Additionally, CSPs have various pricing
measurements. tiers for software, bandwidth, CPU, and storage.
3. Cloud Computing 1) Compute Cost: There are two ways that consumers can be
Cloud computing, according to the National Institute of Stan- charged for CPU costs. Instance-based billing involves charging
dards and Technology (NIST), is a pay-per-use model that en- customers according to the number of instances allotted and the
ables easy, on-demand network access to a pool of configurable length of time that each instance is used. This is true whether or
computing resources (such as networks, servers, storage, appli- not the examples are fully or inadequately utilized. Examples of
cations, and services) that can be quickly provisioned and re- CSPs that use this CPU pricing model are Windows Azure and
leased with little management work or service provider involve- Amazon AWS.
ment [21]. We will now review the three cloud service models,
the five cloud characteristics, and the pricing strategies of cloud CPU cycle-based: The CSP bills the client according to how
service providers (CSPs). many CPU cycles the client's application uses. CloudSites and
Google AppEngine are two CSPs that use this CPU pricing ap-
A) Cloud Characteristics proach.
The cloud model is made up of three virtualized system charac-
teristics: (1) broad network access – cloud computing is network 2) The cost of storage: Every storage transaction requires CPU
based, and accessible from anywhere and from any standard- cycles. There are two different ways that consumers are charged
ized platform (i.e. desktop computers, mobile devices,...); (2) for storage costs. CPU cycles-based billing involves charging a
resource pooling – the multi-tenancy aspect of clouds requires client according to the number of CPU cycles needed to com-
that multiple customers with disparate requirements be served plete each storage activity. As a result, a complex operation may
by a single hardware infrastructure, and therefore, virtualized cost more than a simple one. Examples of CSPs that use this
resources (CPUs, memory,... in particular When an application's CPU pricing model are Google AppEngine, Amazon Simple
load increases (scales up), it is anticipated that the additional re- DB, and CloudSites.
sources can be (a) provisioned, possibly automatically in a matter Number of operations: Regardless of how complex each opera-
of minutes, and (b) released when load decreases (scale- down). tion is, the CSP bases its charges on the total number of opera-
The cloud model is made up of two characteristics of on-demand tions for a customer. CSPs that fit within this CPU price model
computing services in addition to the aforementioned ones: Cus- include, for example: Microsoft Azure Table .
tomers of cloud computing services anticipate on-demand, prac- 3) Costs of Software Licenses: The CSP might offer some soft-
tically immediate access to resources; (4) on-demand self- help; ware for free. Observe that while specific software, like database
(5) measurable service (a.k.a. pay as you go) -Cloud services management systems or MapReduce implementations, is billed
must be priced on a short-term basis (for example, by the hour), on an hourly basis, the majority of operating systems are priced
allowing customers to release resources as soon as they are no per instance and charged to customers.
longer required. Different types of service should be metered in 4) Intra-network costs: The majority of providers offer unlim-
accordance with this (e.g., storage, processing, and bandwidth). ited intra-cloud network bandwidth usage. In essence, there is
no information available on node interconnectivity within a data
B) Cloud Service Models center. Be aware that intra- network bandwidth, for both SQL
Internet-based software, infrastructure, and storage, either as and NoSQL solutions, is crucial for the distributed processing of
individual parts or as a whole platform. There are three main OLAP workloads.
types of cloud service models. The first is Infrastructure as a 5) WAN cost: Fees for accessing the wide-area delivery network
Service (IaaS), which refers to the delivery of computer hard- are determined by how much data is sent to end users through
ware (servers, networks, and storage) as a service by an IaaS the cloud's borders. Currently, the majority of providers charge
provider. Providing operating systems and virtualization tech- about the same amounts for this service, with data upload being
nology to manage the resources may also be part of it. IaaS CSPs free and data download being paid.
include GoGRID and Amazon Elastic Computing Cloud (EC2). 6) SaaS Services: SaaS analytics offers differ from IaaS and
The second is Platform as a Service (PaaS), which gives infra- PaaS analytics offers. In fact, the price model takes the cost of
structure and a comprehensive collection of software, giving the service into account. For example, BigQuery [17] bases the
developers everything they need to create applications. Micro- cost of storage resources on data volume, and the cost of work-
soft Azure Platform and Google AppEngine are two examples of load processing on the quantity of bytes returned for each busi-
PaaS CSPs. . The third is software as a service (SaaS), in which ness question.
a cloud service provider (CSP) offers commercial applications
as a service. Google BigQuery and Amazon Relational Database 4. Data Warehouse Systems
Service are two examples of SaaS providers for data analytics Through the development of quantitative processes that enable a

firm to make the best decisions possible and to perform business ational systems and making the resulting information accessible
knowledge discovery, business intelligence strives to promote for the underpinnings of decision support and data analysis has
better decision-making. Data warehouse systems frequently come to be known as data warehousing.
leverage data that is provided by business intelligence. The idea
of a data warehouse first surfaced in publications written by Bill A) Typical DWS Architecture
Inmon in the late 1980s. To support management's choices, a Fig. 1 illustrates a typical architecture of a data warehouse sys-
data warehouse is referred to as a collection of subject-oriented, tem. The latter is composed of three components: (1) Source
Fig. 1 illustrates
integrated, a typical
non-volatile, architecture
and time-variant of aprocess
data. The data warehouse system.
of integration system,The latterwarehouse
(2) Data is composed
storage of threeand (3)
system
components: (1) Source integration system, (2) Data warehouse storage system and (3) Data analysis
gathering, purifying, and integrating data from a range of oper- Data analysis system.
system.
Figure 1: Typical Data Warehouse System Architecture

Fig. 1. Typical Data Warehouse System Architecture
1) Source Integration System: The source integration process B) Common Optimization Strategies
begins with gathering data from a variety of pertinent data sourc- With the following technologies, data warehouse solutions and
es (such as legacy systems, relational databases, spreadsheets, appliances function better:
1) Source Integration
etc.), after which the System: The source
source schemas integration
are integrated to createprocess
a begins with gathering data from a variety of
single global schema. It provides the specification of how to load
pertinent data sources (such as legacy systems, relational databases,Technologies: 1) Hardware spreadsheets, Some data after
etc.), warehouse
which applications
the
and refresh data in accordance with the global schema and spec- offer specialized hardware items as on-site storage options. To
source
ifiesschemas
the mapping arebetween
integrated to create
the global schemaaand single globalforschema.
the sources processItbig
provides
data and the specification
parallel disk I/O, useof how to load
in-memory databases
and this
refresh dataIninorder
purpose. accordance
to resolvewith the global
nomenclature, schemaand
structural, and specifies
(DRAM) orthe mapping
solid-state between
drives (SSDs).theTheglobal schema
latter enable parallel
data conflicts, integration must address the issue of cleansing
and the sources for this purpose. In order to resolve nomenclature, structural, and data conflicts, integration Be
query execution across dozens or hundreds of disk devices.
and reconciling data from sources. aware that these hardware-based solutions becoming more and
must address the issue of cleansing and reconciling data from sources..
more expensive and out of date.
2) Data Warehouse Storage System: Two basic methods for stor- 2) Columnar Storage Technology: In a column-oriented storage
2) Data Warehouse
ing data Storage may
in a data warehouse System: Two basic
be distinguished, methods
including for storing
I system, various data
storageinvolumes
a dataor warehouse
data blocks aremayused tobestore
distinguished, including I MOLAP, in which data is immediately saved into multidimensional
MOLAP, in which data is immediately saved into multidimen- the column value (or family of columns) of each record. Com- data cubes.
Datasional
cubes dataare created
cubes. Data and
cubesstored usingand
are created a multidimensional
stored using a pared storage engine,
to standard row- while (ii) ROLAP
based storage systems,physically
this technology
multidimensional
stores the data warehousestorage engine,
using while (ii) ROLAP
a traditional physicallydatabase
relational enablesmanagement
greater compression
system ratios
andanddefines
scan throughputs.
the cubes
stores the
logically. data warehouse
There using a OLAP
are also hybrid traditional relational (HOLAP),
solutions database 3)which
Data warehouses use derived data suchprocessing
enable multidimensional as OLAP indixeswith(such
management system and defines the cubes logically. There are as bitmap, n-tree,...), derived attributes, and aggregate tables in
direct access to relational data as well as aggregates and pre-calculated results stored on their own
also hybrid OLAP solutions (HOLAP), which enable multidi- order to receive a quick response (a.k.a. materialized views).
multidimensional
mensional processingdisk. with direct access to relational data as
3) Data Analysis System: An OLAPresults
well as aggregates and pre-calculated server is integrated
stored on their own intoC) the
TPC-Hdata analysis system. The latter is a multi-
Benchmark
user,multidimensional
high-capacitydisk. data manipulation engine created primarily to work with
The many benchmarks releasedmulti-dimensional data
by the Transaction Processing
structures (or databases). The exploratory nature of multidimensional Council are the most querying
well-known usedstandards
by OLAP clientsdeci-
for assessing
3) Data
enables Analysis System: Anthe
I increase/decrease OLAPlevelserver is integrated
of detail into sion
(respectively support systems
drill-down and (TPC).
roll-up WeOLAP
then introduce TPC- H, the
operations), (ii)most
the data analysis system. The latter is a multi- user, high-ca-
concentrate on particular cube subparts for on-screen viewing (slice and dice OLAP operations), and (iii)tradi-
used benchmark in the research community. Utilizing the
pacity data manipulation engine created primarily to work with tional product-order-supplier model is the TPC-H benchmark.
rotate dimensions to new on-screen viewing (rotate OLAP operation).
multi-dimensional data structures (or databases). The explorato- It comprises of a number of concurrent data updates and busi-
ry nature of multidimensional querying used by OLAP clients ness-oriented adhoc queries. Twenty-two highly sophisticated
B) Common Optimization Strategies
enables I increase/decrease the level of detail (respectively drill- parameterized decision-support SQL queries make up the work-
down and roll-up OLAP operations), (ii) concentrate on partic- load, together with two refresh functions called RF-1 new sales
ular With the following
cube subparts technologies,
for on-screen viewing (slice dataandwarehouse
dice OLAPsolutions and appliances
(new inserts) and RF-2 oldfunction better:
sales (deletes). The set of fixed scale
operations), and (iii) rotate dimensions to new on-screen view- factors with the following definitions must be used to select the
ing (rotate Technologies:
1) Hardware OLAP operation).Some data warehouse applications scale factors for the test database:
offer specialized hardware 1, 10,... 100,000;
items the resulting
as on-site
storage options.
Adv Bioeng To Sci
Biomed process Volume
big data and parallel disk I/O, use in-memory databases (DRAM)
Res, 2023 6 | Issue 9 | 151
or solid-state
drives (SSDs). The latter enable parallel query execution across dozens or hundreds of disk devices. Be
aware that these hardware-based solutions becoming more and more expensive and out of date.
raw data volumes are 1GB, 10GB,... 100TB, respectively. rate representation of how the demand for necessary hardware
and software resources will change over time.
1) TPC-H Metrics: The TPC-H benchmark provides two key
metrics: (see details in Appendix A) Fourth, no TPC-benchmark presently gives a cost-effectiveness
ratio statistic. The company should be able to select the ideal
Query-per-Hour Performance Metric for TPC-H (QphH@Size): hardware configuration for maintaining its data and handling its
The QphH@Size statistic captures many facets of the query pro- workload efficiently with the aid of the cloud migration. When
cessing ability of the system under examination. These factors an Amazon EC2 Large Instance (7.5GB of memory and 4 EC2
include I the chosen database size against which the queries are compute units for $0.240 per Hour) meets the workload require-
executed (also known as the scale factor), (ii) the power test, ments, it is inconvenient to pay for an Amazon EC2 Extra Large
which measures the processing power of the queries when they Instance (15GB of memory and 8 EC2 compute units for $0.480
are submitted by a single stream, and (iii) the throughput test, per Hour) ) [61].
which measures the query throughput when it is submitted by
multiple concurrent users. Fifth, the current TPC-H implementation assumes that both
workload streams for queries and refresh functions are conduct-
Price-performance metric for TPC-H ($/QphH): The cost-to-per- ed simultaneously. Old data requires the processing of deletes,
formance ratio is represented by the $/QphH measure. The price and most NoSQL systems (such as Apache Hadoop) employ the
of the priced system is determined by taking into account I the write-once technique and are not built to handle deletes. There
cost of the hardware and software present in the system being are two sorts of refresh functions: new data and old data. As a
tested, (ii) the cost of the communication interface supporting result, deletes, for Apache Hadoop, for example, entail exceed-
the necessary number of user interface devices, (iii) the cost of ingly expensive join procedures and the loading of fresh data
online storage for the database and storage for all software, (iv) files into the system.
the cost of additional products (either software or hardware)
needed for routine operation, administration, and maintenance Sixth, according to the CAP theorem, also referred to as Brew-
for a period of three years, and (v) the final cost. er's theorem, a distributed computer system cannot simultane-
ously provide all three of the following guarantees: I Consisten-
2) TPC-H mismatch for cloud-based DWS evaluation: Using cy, which ensures that all nodes see the same data at the same
TPC-H to assess cloud-based data warehouse systems highlights time; (ii) Availability, which ensures that every request receives
the following issues: a response indicating whether it was successful or unsuccessful;
and (iii) Partition tolerance, which ensures that the system con-
First, the TPC-H benchmark is not appropriate for evaluating tinues to function. Additionally, Brewer demonstrated that in a
commercial business intelligence suites, such as integration distributed system, only two of the three promises are met. The
services (ETL performances), OLAP engines (building OLAP existing TPC-H specification (and the same goes for TPC-DS)
hypercubes), mining structures (building data mining models), presupposes parallel machine deployment of TPC-H rather than
and reporting tools, given the technical evolution of OLAP tech- shared-nothing architecture. When considering refresh functions
nologies in recent years. and high-availability, benchmarking data warehousing systems
in the cloud on a shared-nothing architecture should implement
Second, the number of queries processed per hour that the sys- all possible combinations of guarantees, namely CA, CP, and
tem under test can manage for a fixed load is the main statistic AP.
employed by TPC-H -QphH@Size. The system under test is
then regarded as static, and this metric does not demonstrate the Last but not least, the TPC-H benchmark does not include suf-
system's ability to scale, that is, how well the system performs ficient measures for evaluating cloud system characteristics
under varying loads and cluster sizes. including scalability, pay-per-use, fault tolerance, and service
level agreements. The requirements and fresh metrics for evalu-
Third, the ratio of costs to performance, or the second TPC-H ating data warehousing systems in the cloud are presented in the
metric, $/QphH, determines pricing based on the total cost of next section.
ownership of the system that is being tested on-site. The owner-
ship cost includes the cost of the hardware, the cost of the soft- Benchmarking Data Warehouse Systems in The Cloud
ware license, as well as the costs of administration and main- Due to the process' intrinsic complexity, data warehousing is both
tenance over a three-year period. The pay-as-you-go model of expensive and time-consuming. A data warehousing system's
cloud computing is incompatible with this since cloud users are cloud deployment is considerably different from its on-premises
not directly responsible for the costs of administration, mainte- deployment. In actuality, there are differences between a com-
nance, and administration of their deployment of hardware and pany's BI department and its clients and the CSP's connection
software. The cost-performance ratio for the cloud depends on with them. The move to the cloud should increase corporate pro-
the data volume, workload, services, chosen hardware, and the ductivity and increase end-user happiness. Therefore, end-user
CSP pricing plan. There are various price plans for the cloud. satisfaction, Quality of Service (QoS), as well as the inherent
properties of cloud systems, such as scalability, pay-per-use, and
Additionally, the dynamic lot-size model provides a more accu- fault-tolerance, should be reflected in benchmarks established

for evaluating data warehousing systems in the cloud. The majority of CSPs offer free data transport to their data cen-
ters (Data Transfer IN To Amazon EC2 From The Internet Costs
Then, we propose new standards and new measures that seek to $0.00 Per GB, for example). The cost of downloading data varies
create a fair comparison between various cloud systems provid- depending on the volume (e.g., Data Transfer OUT To Amazon
ers of data warehouse systems. First, we present use cases for EC2 From Internet $0.12 per GB per month for data quantities
benchmarking data warehouse systems in the cloud. consisted of between 1GB and 10TB, while it is free for lesser
data volumes) [61].
A) Use Cases
There are two key use cases for comparing cloud-based data Second, to improve performance, the majority of OLAP engines
warehousing systems. The first step is a comparison of several use intra-query parallelism. A complex single question is divided
cloud systems with the goal of choosing the best CSP for the up into smaller requests, the burden is distributed among several
data warehousing system's eventual deployment. The second processors, and finally post-processing is done in order to pres-
step in system tuning is to choose the appropriate optimizations, ent the final query response. Subject to intra-query parallelism,
cost-performance tradeoffs, and cost-efficiency tradeoffs for a three variables have an impact on the query's final response time.
given CSP's capacity planning (operating system, number of in- First, startup costs, which are incurred when several processes
stances, instance hardware configuration, etc.). are launched in order to handle multiple sub-queries simultane-
ously.
B) New Requirements and Metrics
We next go over updated specifications and metrics for compar- If there is a high level of parallelism, the setup time for these
ing cloud-based data warehouse systems. processes may take up the majority of the calculation time. Sec-
ond, Skew costs demonstrate that in a distributed system, the
1) High Performance: Data warehousing is used to assist with slowest performing activities that are running in parallel decide
decision-making. In order to increase corporate productivity, the the overall execution time. Third, interference costs, which are
latter demands good performance. High performance is impact- related to the amount of time that processes are not being used.
ed by two key aspects of cloud data storage: I data transport to/ In fact, processes that use shared resources (such as the system
from the CSP, and (ii) workload processing. bus, disks, or locks) face competition from one another and must
wait for other processes to complete their tasks.
First, the source integration system and data analysis system
deal with such large data sets, transferring significant data loads Pig script reaction times were measured in trials (details in Ap-
to remote servers typically uses a lot of bandwidth and is more pendix B), and the results show a concave curve (Fig. 2), with
efficient when done locally. Therefore, cloud computing is diffi- an optimal response time for each cluster size and performance
cult due to slow connections and network congestion unless an degrading after this point. The performance increase slope (from
expensive private link is established between the provider and N to N') for cloud computing should also be stated in a dol-
the company. Companies will encounter network-bound apps lar amount ($). In fact, the system scales out horizontally and
in the cloud as opposed to I/O- and CPU-bound applications more instances are provisioned to achieve this improvement in
on-premises. The network bandwidth that is available to han- response time.
dle large data transfers to and from the CSP will, in fact, be the
bottleneck.
Fig. 2.2:Response
Figure Response Times of OLAP
Times of OLAPqueries
queriesacross
across Cluster
Cluster Size.
Size.
Scalability:
Adv Bioeng2)Biomed Scalability
Sci Res, 2023 is a system's capacity to enhance throughput overall when facedVolume
with a6 heavier
| Issue 9 | 153
load and more hardware resources. Cloud services should ideally have a set cost per processed business
question and linear scaling. The current TPC-H implementation gauges a system's ability to handle a
static workload. We suggest that the benchmark for data warehousing evaluate the system being tested
under a constant load and calculate the throughput as a result. We can change the workload on a time
2) Scalability: Scalability is a system's capacity to enhance system failing is increased when data is scattered over several
throughput overall when faced with a heavier load and more drives. There are many methods that can be used to construct
hardware resources. Cloud services should ideally have a set highly available distributed data storage systems. They typically
cost per processed business question and linear scaling. The cur- employ parity calculus or replication. The latter method makes
rent TPC-H implementation gauges a system's ability to han- use of systematic erasure-codes, such as Tornado, Low-Density
dle a static workload. We suggest that the benchmark for data Parity-Check, and Reed Solomon (RS) codes. Data management
warehousing evaluate the system being tested under a constant is simple with replication. However, replication always has a
load and calculate the throughput as a result. We can change the higher storage expense than systematic erasure codes.
workload on a time scale, say every hour, and count the number
of business inquiries answered throughout that time period to Erasure codes can offer services with less storage overhead than
quantify this requirement. replication methods when a specific level of availability is tar-
geted. High availability with erasure codes reduces storage costs
While a non-scalable system records fewer business questions for data warehousing, especially for massive data of the write-
answered under a larger load, a scalable system should keep the once kind (i.e., not subject to delete refreshes). Data recovery,
same number of business questions handled within a time peri- however, is trickier than replication. Erasure codes have been
od. examined and shown to be effective for grid systems and high-
ly available distributed storage systems, respectively [18, 19].
3) Elasticity: To adapt the system capacity at runtime to the The storage space requirements for several file high-availabili-
changing workload, elasticity adds and removes resources with- ty strategies are shown in Fig.3. namely replication and erasure
out affecting service. First, the metric should evaluate the sys- codes. In our example, we show 4 blocks of a data file (m = 4)
storage
tem's ability space
to add requirements
or remove resourcesforwithout
several file high-availability
affecting ser- stored in suchstrategies areany
a way that shown
(n m) in= 2Fig.3.
missing namely
blocks can be
vice, andreplication and erasure codes. In our example, we show 4 blocks of a data file (m = 4) stored in such a With
if it does, it should report both the scaling latency (the tolerated; values n = 6 and m = 4 are used as an example.
time it takes for aany
way that system to scale
(n m) up or down
= 2 missing horizontally)
blocks and replication,
can be tolerated; values nk=copies
6 and of
m the
= 4 entire file as
are used areanstored into separate
example.
the scale-up cost (the price of newly acquired resources, in dol- places. The group of data blocks is 2- available through repli-
With replication, k copies of the entire file are stored into separate places. The group of data blocks is 2-
lars) or the scale-down gain (the price of newly released resourc- cation with a redundancy overhead of 200% versus the same
available
es, in dollars), through replication with a redundancy overhead
as appropriate. groupofof200% versus 2-available
data blocks the same group
through of erasure-codes
data blocks with a
2-available through erasure-codes with a redundancy overhead redundancy of overhead
50%. of 50%.
4) High Availability: The likelihood of a distributed storage
Figure 3: Replication vs. Erasure Codes for a group of 4 data blocks.

Fig. 3. Replication vs. Erasure Codes for a group of 4 data blocks.
Replication is used by CSPs to improve the availability of stored using cloud data warehousing, and various levels of availability
dataReplication
and avoid serviceis used by CSPsAdditionally,
interruption. to improve theythe provide
availability of stored
should data to
be provided and avoid
allow for service interruption.
customization of the recovery
Additionally,
replica managementthey provide
in many replica management
geographically in many
dispersed data cen- geographically dispersed data centers. This makes it
capacity after disasters.
ters.possible
This makes to itrecover
possiblefrom
to recover
a datafrom a data
center center disaster
disaster in its entirety. However, the majority of CSPs do not cater
in itshigh
entirety.
availability services to their clients. not cater high As a result, in addition to assessing the recovery cost, the bench-
However, the majority of CSPs do
availability services to their clients. mark should also include metrics reflecting the cost of various
The end-user should be informed of the cost of makingintended levels
their data of availabilities through
highly-available (1-available,...,
variousk-available,
high or
The end-user should be informed of the cost of making their the number of failures the system can accept). We suggest two
availability strategies (i.e., for both synchronous and asynchronous refreshes) when using cloud data
data highly-available through various high availability strategies measures that represent the cost of maintaining a k-available
(i.e.,warehousing, and various
for both synchronous and levels of availability
asynchronous should
refreshes) whenbe provided to allow
system ($@k), forkcustomization
where of theofrecovery
is the desired level availability, and
capacity after disasters.
As a result, in addition to assessing the recovery cost, the benchmark should also include metrics reflecting
the cost of various intended levels of availabilities (1-available,..., k-available, or the number of failures the
system can accept). We suggest two measures that represent the cost of maintaining a k-available system
6) Service Level Agreements: An SLA is a contract that a service provider and its customers enter into. SLAs
record the promises that have been agreed upon between a service provider and its client. They specify the
characteristics of the offered service, such as the maximum response times, minimum throughput rates, and
data consistency, and they specify consequences if the service provider does not meet these goals.
Scalability, elasticity, performance (throughput and reaction time are also taken into account), high-
a metric that represents
availability, and CSPthe customer-perceived
independence arecost
theofSLA
recovery In the latter
categories case,warehousing
for data the business should have
in the no trouble switching
cloud.
represented in time and lost system productivity due to hardware to a different Cloud Service Provider (CSP) and receiving its
breakdown ($). The CSP should be billed for the latter.
In the latter case, the business should have no troubledata back in a common format. This will reduce losses in the
switching to a different Cloud Service Provider (CSP)
event that the CSP levies exorbitant fees, demands the purchase
5)and receiving its data
Cost-Effectiveness and back in a common
Cost-Performance: format. This
Cloud-based willsoftware,
of new reduceorlosses
declaresinbankruptcy.
the event that the CSP levies
exorbitant fees, demands the purchase of new software, or declares bankruptcy.
solutions should assist businesses looking to reduce expenses
without sacrificing productivity and service quality. As a result, c) OLAP vs. OLTP Benchmarking in the Cloud
it c)
is becoming
OLAP vs. increasingly important to comprehend,
OLTP Benchmarking in the Cloud monitor, We are greatly influenced by the work in [13][14]. The latter
and proactively control expenses throughout the cloud from the looked into cloud-based OLTP benchmarking. We offer a thor-
viewpoints of performance and effectiveness. In fact, the user ough comparison of OLAP and OLTP benchmarking in the
We are greatly influenced by the work in [13][14]. The latter looked into cloud-based OLTP benchmarking.
may prefer to operate his application more efficiently, which cloud in Table I.
We offer
ensures a thorough
a maximal comparison
computation at lowest of OLAPrather
expenses, and OLTP
than benchmarking in the cloud in Table I.
focusing on the shortest possible execution time. The best hard- 5. Conclusion
ware configuration should be determined based on performance The three main reasons for moving data warehouse systems to
and effectiveness, and included in the cost management plan; the cloud are: I lower capital costs through measured services,
Conclusion
This presupposes systematic resource usage monitoring. We where infrastructure, platforms, and services are offered on a
suggest calculating the configuration cost ($) to performance and pay-per-use basis;
resource utilization ratio main
The three for these objectives.
reasons for The ratio of
moving usedwarehouse
data (ii) faster systems
elasticity; to
andthe
(iii)cloud
quicker provisioning
are: for a better
I lower capital costs
to allocated resources is known as resource utilization. Take note cost-performance trade-off. In this study, we contend that the
through measured services, where infrastructure, platforms,
of how usage and allocated resources change over time.
and services are offered on a pay-per-use basis;
most well-known OLAP benchmark, TPC-H, does not accurate-
(ii) faster elasticity; and (iii) quicker provisioning forlyareflect bettercloud
cost-performance
properties. We also trade-off. In this
provide new study, we
benchmarking
6)contend
Service Level Agreements: An SLA is a contract that a service criteria and indicators, including
that the most well-known OLAP benchmark, TPC-H, does not accurately reflect cloud properties. high performance, high avail-
provider and its customers enter into. SLAs record the promis- ability, cost effectiveness, cost performance, scalability, elastic-
We also provide new benchmarking criteria and indicators, including high performance, high availability,
es that have been agreed upon between a service provider and ity, and service level agreements (SLAs), for data warehouse
itscost
client.effectiveness,
They specify thecost performance,
characteristics scalability,
of the offered service,elasticity,
deploymentandinservice
the cloud.level agreements
In upcoming work,(SLAs), for data
we will evaluate
warehouse deployment in the cloud. In upcoming work, we will evaluate the most well-known CSPs
such as the maximum response times, minimum throughput the most well-known CSPs that Google, Amazon, and that
Micro-
rates, and data consistency, and they specify consequences if the soft provide for data warehousing using a cloud-based TPC-H
Google, Amazon, and Microsoft provide for data warehousing using a cloud-based TPC-H benchmark.
service provider does not meet these goals. Scalability, elasticity, benchmark.
performance (throughput and reaction time are also taken into
account), high- availability, and CSP TABLE I .OLAP
independence VERSUS
are the SLA OLTP IN THE CLOUD.
categories for data warehousing in the cloud.
Data Warehouse System Deployement in the Operational System Deployement in the cloud
Cloud and OLAP Workload Run and Web-shop/OLTP alike Workload Run
• Fast upload and download of huge data sets, Good Responsivity to the user’s interactions
Goals from Customer Perpective • Browse a multidimensional view of data (Transaction branches) and validation of the
in mere seconds inter- action (transaction)
More
interacti process pe
Horizontal scale-up Added Value More Bytes processed per hour (+BpH) Web second
ons ed r
(+wips)
• High storage cost for Data Warehouse System • Storage Cost required by the Operational
Cost •Complex and costly workload (CPU, System
I/O, Network-bound applications) • Simple workload
$/wips (cost of web interactions processed per
Metric of Interest $/BpH (cost of Bytes processed by hour)
second)
Recommended High-Availability
Systematic Erasure Codes & Replication Replication
Schema
Distributed Processing Features Intra-parallelism within a business question Inter-parallelism among interactions
The compagny may do not take decisions in- Unexpected performance problems and loss of
Risks under Peak Loads
time ($) responsivity to end-users. ($)
TABLE I: OLAP VERSUS OLTP IN THE CLOUD.

The experiments
The experiments that were that
done were done Apache
to compare to compare Hadoop/ Apache
1.1GB Hadoop/Pig Latin intoFig.
of data are shown the 4.TPC-H benchmark
It is important to noteare
that
Pig Latin to the TPC-H benchmark are described here. [20] de- doubling the size of
described here. [20] describes the conversion of the TPC-H workload from SQL to Pig Latin as well as an the Hadoop cluster has no impact on how
scribes the
analysis ofconversion
the tasksof written
the TPC-H inworkload
Pig LatinfromforSQLthe to Pig
TPC-Hquickly pig scriptsBorderline
workload. execute. Business
nodesquestions
of the that do not re-
GRID5000
Latin as well as an analysis of the tasks written in Pig Latin quire join operations, such as Q1 and Q6, have faster execution
platform are placed at the Bordeaux facility and make up the hardware system configuration for performance
for the TPC-H workload. Borderline nodes of the GRID5000 times as cluster size grows. TPC-H workload response times for
evaluations.
platform are Each
placedBorderline
at the Bordeauxnodefacility
has 32 andGB of memory
make and four
up the 11GB of dataIntel XeoninCPUs
are shown Fig. 5. running
Contrary to atresults
2.6 GHz with
correspond-
two cores each.
hardware systemLenny DebianforOperating
configuration performanceSystem is used by
evaluations. ing all
to anodes.
volume of 1GB, reaction times are generally improved
WeEachtest Pig's performance
Borderline node has 32 GBwith a range
of memory andof cluster
four sizes with
Intel Xeon and increasing
data volumes. The Business-related
cluster size. performance results complex for Pig
questions
for N=3, 5, 8 nodes (corresponding to 2, 4, and 7 Hadoop Task Trackers/data nodes or workers and one
CPUs running at 2.6 GHz with two cores each. Lenny Debian like Q2, Q11, Q13, and so forth are unaffected by cluster size.
Hadoop master) are shown in the following figures. We produced TPC-H data files with SF=1, 10, and 1.1of
Operating System is used by all nodes. Pig performs admirably when compared to results for 1.1GB
data shown in Figure 4 when the data amount is multiplied by
and 11 GB, respectively. TPC-H workload response times for 1.1GB of data are shown in Fig. 4. It is
We test Pig's performance with a range of cluster sizes and data 10. In fact, the average response time for a 1.1GB warehouse is
important
volumes. The performance results for Pig for N=3, 5, 8 nodes two times has
to note that doubling the size of the Hadoop cluster nothan
slower impact on how
the average quickly
response timepigfor scripts
an 11GB
execute. Business
(corresponding to 2, questions that Task
4, and 7 Hadoop do not require join
Trackers/data nodesoperations, such asregardless
TPCH warehouse, Q1 and ofQ6, the have
clusterfaster execution
size (N=3, 5, or 8).
times as cluster
or workers and one size grows.
Hadoop TPC-H
master) workload
are shown in the response
following times
We comefor 11GB of data are
to the conclusion thatshown in Fig.
while cluster size5.isContrary
significant,
tofigures.
resultsWecorresponding
produced TPC-H todata
a volume
files withofSF=1,
1GB,10,reaction
and 1.1 times are perormances
workload generally improved
are not alwayswith increasing
improved by it. cluster
and 11 GB, respectively. TPC-H workload response times for
size. Business-related complex questions like Q2, Q11, Q13, and so forth are unaffected by cluster size. Pig
performs admirably when compared to results for 1.1GB of data shown in Figure 4 when the data amount is
multiplied by 10. In fact, the average response time for a 1.1GB warehouse is two times slower than the
average response time for an 11GB TPCH warehouse, regardless of the cluster size (N=3, 5, or 8). We come
to the conclusion that while cluster size is significant, workload perormances are not always improved by it.

Fig.Figure
4. Pig performances
4: Pig performances (sec) for1.1GB
(sec) for 1.1GB of TPC-H
of TPC-H data (SF=1).
data (SF=1).
Fig. 4. Pig performances (sec) for 1.1GB of TPC-H data (SF=1).
Figure 4: Pig performances (sec) for 11GB of TPC-H data (SF=10).

References Fig. 5. Pig performances (sec) for 11GB agementof ofTPC-H data
data (pp. (SF=10).
165-178).
Fig. 5. Pig performances (sec) for
1. G. group ? http://www.gartner.com/it/page.jsp?id=920712, 8. 11GB
Moussa,of R.
TPC-H
(2012).data (SF=10).
Massive Data Analytics in the Cloud:
?. TPC-H Experience on Hadoop Clusters. Int. J. Web Appl.,
2. Forrester, Sizing the cloud, http://www.forrester.com, 2011. 4(3), 113-133.
3. References
Laskowski, N. Business intelligence software market con- 9. Gray, J. (2008). Sort benchmark home page.
tinues to grow.
References 10. Cooper, B. F., Silberstein, A., Tam, E., Ramakrishnan, R.,
4. [1] G. group,
Reichman, ―?‖ http://www.gartner.com/it/page.jsp?id=920712,
A. (2011). File storage costs less in the cloud & Sears, R.?. (2010, June). Benchmarking cloud serving sys-
than in-house. Forrester Research, Cambridge, MA. tems with YCSB. In Proceedings of the 1st ACM sympo-
[2]
[1] Forrester,
G. group, ―Sizing the cloud,‖ http://www.forrester.com, 2011. ?.
―?‖ http://www.gartner.com/it/page.jsp?id=920712,
5. Nguyen, T. V. A., Bimonte, S., d'Orazio, L., & Darmont, J. sium on Cloud computing (pp. 143-154).
[3]
[2]
(2012, N.
Forrester, Laskowski,
March).―Sizing
Cost models thefor ―Business
cloud,‖
view intelligence
http://www.forrester.com,
materialization software
in the 11. Sobel,2011. market S., Sucharitakul,
W., Subramanyam, continues A., toNguyen,grow,‖
J.,
http://www.gartner.com/it/page.jsp?id=1553215,
[3] N.Proceedings
cloud. In Laskowski,
of the 2012―Business
Joint EDBT/ICDT Work- 2011. Wong,
intelligence software marketA., ... continues
H., Klepchukov, & Patterson, D.to(2008,grow,‖
Oc-
shops (pp. 47-54). tober). Cloudstone: Multi-platform, multi-language bench-
[4] Forrester, ―File storage costs less in the cloud2011.
http://www.gartner.com/it/page.jsp?id=1553215, than in-house,‖
6. Floratou, A., Patel, J. M., Lang, W., & Halverson, A. (2012). mark and measurement tools for web 2.0. In Proc. of CCA
[4] Forrester,
When ―File
free is not reallystorage 2011.
free: Whatcosts
doesless in the
it cost cloud
to run a da-than in-house,‖
(Vol. 8, p. 228).
[5] T.-V.-A. Nguyen,
in the S.
tabase workload Bimonte,
2011.
cloud?. L. d’Orazio,
In Topics and J. Darmont,
in Performance 12. Bennett,―Cost models R. L., Locke, D., Seidman, J., &
C., Grossman,
Evaluation,
for T.-V.-A.
[5] Measurement
view materialization and Characterization:
in the cloud,‖
Nguyen, S. Bimonte, Third TPC
in EDBT/ICDT
L. d’Orazio, Vejcik,
Workshops,
and J. Darmont, S. (2010,
―Cost modelsJuly). Malstone: towards a benchmark for
Technology Conference, TPCTC 2011, Seattle, WA, USA, analytics on large data clouds. In Proceedings of the 16th
2012,
for view
August pp. 47–54. 3, 2011,
materialization
29-September in the cloud,‖
Revised in EDBT/ICDT
Selected Papers 3 Workshops,
ACM SIGKDD International conference on Knowledge
[6]
(pp.A.163-179).
2012, Floratou,
pp. 47–54. J. M. Berlin
Springer Patel,Heidelberg.
W. Lang, and A. Halverson, ―When discoveryfree is not
and data mining (pp. 145-152).
7. [6]
Pavlo,
really A.,
A. free: Paulson, E.,
WhatJ.does
Floratou, Rasin, A.,
it cost
M. Patel, Abadi,
W.toLang,D. J.,
run a and DeWitt,
database D. J., 13. Binnig,
workload―When
A. Halverson, C.,
free is not D., Kraska, T., & Loesing, S. (2009,
in the cloud?‖Kossmann,
Madden, S., & Stonebraker, M. (2009, June). A comparison June). How is the weather tomorrow? Towards a benchmark
inofTPCTC,
really
approaches 2011,
free: What pp. 163–179.
does
to large-scaleit data
costanalysis.
to run aIndatabase
Proceedings workload
of in the
for the cloud?‖
cloud. In Proceedings of the Second International
[7]
intheA.
TPCTC,
2009Pavlo,
ACM E.
2011, Paulson,
pp. 163–179.
SIGMOD A. Rasin,
International D. J. Abadi,
Conference D. J. DeWitt,
on Man- S. Madden,
Workshop on Testing Database Systems (pp. 1-6).
and
Adv[7] M.Pavlo,
A.
Bioeng Stonebraker,
Biomed E. ―A A.
SciPaulson,
Res, 2023 comparison
Rasin, D. of approaches
J. Abadi, D. J. to large-scale
DeWitt, data
S. Madden, Volume 6 | Issue 9 | 157
analysis,‖
and in SIGMOD
M. Stonebraker, Conference,
―A comparison2009, pp. 165–178.
of approaches to large-scale data
analysis,‖ in SIGMOD Conference, 2009, pp. 165–178.
14. Kossmann, D., Kraska, T., & Loesing, S. (2010, June). An Transactions on Database Systems (TODS), 30(3), 769-811.
evaluation of alternative architectures for transaction pro- 19. Pitkanen, M., Moussa, R., Swany, M., & Niemi, T. (2006,
cessing in the cloud. In Proceedings of the 2010 ACM SIG- February). Erasure codes for increasing the availability of
MOD International Conference on Management of data (pp. grid data storage. In Advanced Int'l Conference on Tele-
579-590). communications and Int'l Conference on Internet and Web
15. Li, A., Yang, X., Zhang, M., & Kandula, S. (2010). Cloud- Applications and Services (AICT-ICIW'06) (pp. 185-185).
Cmp: Shopping for a Cloud Made Easy. HotCloud, 10, 1-7. IEEE.
16. Xue, C., Lin, C., & Hu, J. (2019). Scalability analysis of 20. Moussa, R. (2012, June). Tpc-h benchmarking of pig latin
request scheduling in cloud computing. Tsinghua Science on a hadoop cluster. In 2012 International Conference on
and Technology, 24(3), 249-261. Communications and Information Technology (ICCIT) (pp.
17. Sato, K. (2012). An inside look at google bigquery. White 85-90). IEEE.
paper, URL: https://cloud. google. com/files/BigQuery- 21. Mell, P., & Grance, T. (2009). The NIST definition of cloud
TechnicalWP. pdf. computing. National Institute of Standards and Technology.
18. Litwin, W., Moussa, R., & Schwarz, T. (2005). LH* RS-- Information Technology Laboratory, Version, 15(10.07),
-a highly-available scalable distributed data structure. ACM 2009.
Copyright: ©2023 Seyed Ebrahim Dashti, et al. This is an open-access

article distributed under the terms of the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in
any medium, provided the original author and source are credited.
Adv Bioeng Biomed Sci Res, 2023 https://opastpublishers.com Volume 6 | Issue 9 | 158

Benchmarking Criteria For A Cloud Data Warehouse

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Benchmarking Criteria For A Cloud Data Warehouse

Uploaded by

Copyright:

Available Formats

ISSN: 2640-4133

Review Article Advances in Bioengineering & Biomedical Science Research

Adv Bioeng Biomed Sci Res, 2023 Volume 6 | Issue 9 | 148

Adv Bioeng Biomed Sci Res, 2023 Volume 6 | Issue 9 | 149

Adv Bioeng Biomed Sci Res, 2023 Volume 6 | Issue 9 | 150

Figure 1: Typical Data Warehouse System Architecture

Adv Bioeng Biomed Sci Res, 2023 Volume 6 | Issue 9 | 152

Figure 3: Replication vs. Erasure Codes for a group of 4 data blocks.

TABLE I: OLAP VERSUS OLTP IN THE CLOUD.

Adv Bioeng Biomed Sci Res, 2023 Volume 6 | Issue 9 | 155

Adv Bioeng Biomed Sci Res, 2023 Volume 6 | Issue 9 | 156

Figure 4: Pig performances (sec) for 11GB of TPC-H data (SF=10).

Copyright: ©2023 Seyed Ebrahim Dashti, et al. This is an open-access

You might also like